[BULK] Re: [WebDNA] set HTTP-Status Code from webdna

This WebDNA talk-list message is from

2016


It keeps the original formatting.
numero = 112673
interpreted = N
texte = 256 --Apple-Mail=_90C4CAA8-665E-4AC4-91CC-168D29778001 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Backstory: the site is question is a replacements part business and has = hundreds of thousands of pages of cross reference material, all stored = in databases and generated as needed. Competitors and dealers that carry = competitors brand parts seem to think that copying our cross reference = is easier then creating their own (it would be) so code was written to = block this. YES, I KNOW that if they are determined, they will find a way around my = blockades (I=E2=80=99ve seen quite a few variations on this: tor, AWS, = other VPNs=E2=80=A6)=20 Solution: looking at the stats for the average use of the website, we = found that 95% of the site traffic visited 14 pages or less. So=E2=80=A6=20= I have a visitors.db. The system logs all page requests tracked by IP = address, and after a set amount (more then 14 pages, but still a pretty = low number) starts showing visitors a nice Page Limit Exceeded page = instead of what they were crawling thru. After an unreasonable number of = pages I just 404 them out to save server time and bandwidth. The count = resets at midnight, because I=E2=80=99m far to lazy to track 24 hours = since the first or last page request (per IP.) In some cases, when I=E2=80= =99m feeling particularly mischievous, once a bot is detected i start = feeding them fake info :D=20 Here=E2=80=99s the Visitors.db header: (not sure if it will help, but = it is what it is) VIDIPaddippermipnamevisitdatepagecount= starttimeendtimedomainfirstpagelastpagebrowtype= lastskupartnerlinkinpage9page8page7page6page5page4= page3page2page1 All the code that does the tracking and counting and map/reduction to = store stats and stuff is proprietary (sorry) but I=E2=80=99ll see what = (if anything) I can share a bit later, and try to write it up as a blog = post or something. -Brian B. Burton > On Mar 24, 2016, at 11:41 AM, Jym Duane wrote: >=20 > curious how to determine...non google/bing/yahoo bots and other = attempting to crawl/copy the entire site? >=20 >=20 >=20 > On 3/24/2016 9:28 AM, Brian Burton wrote: >> Noah,=20 >>=20 >> Similar to you, and wanting to use pretty URLs I built something = similar, but did it a different way. >> _All_ page requests are caught by a url-rewrite rule and get sent to = dispatch.tpl >> Dispatch.tpl has hundreds of rules that decide what page to show, and = uses includes to do it.=20 >> (this keeps everything in-house to webdna so i don=E2=80=99t have to = go mucking about in webdna here, and apache there, and linux somewhere = else, and etc=E2=80=A6)=20 >>=20 >> Three special circumstances came up that needed special code to send = out proper HTTP status codes: >>=20 >> >> [function name=3D301public] >> [text]eol=3D[unurl]%0D%0A[/unurl][/text] >> [returnraw]HTTP/1.1 301 Moved Permanently[eol]Location: = http://www.example.com[link][eol][eol][/returnraw] = >> [/function] >>=20 >> >> [function name=3D404hard] >> [text]eol=3D[unurl]%0D%0A[/unurl][/text] >> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not = Found[eol]Content-type: text/html[eol][eol][eol][eol]

404 = Not Found

[eol]The page that you have requested ([thisurl]) could = not be found.[eol][eol][/returnraw] >> [/function] >>=20 >> >> [function name=3D404soft] >> [text]eol=3D[unurl]%0D%0A[/unurl][/text] >> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not = Found[eol]Content-type: text/html[eol][eol][include = file=3D/404pretty.tpl][/returnraw] >> [/function] >>=20 >> Hope this helps >> -Brian B. Burton --------------------------------------------------------- This message is sent to you because you are subscribed to the mailing list . To unsubscribe, E-mail to: archives: http://mail.webdna.us/list/talk@webdna.us Bug Reporting: support@webdna.us --Apple-Mail=_90C4CAA8-665E-4AC4-91CC-168D29778001 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8
Backstory: the site is question is a = replacements part business and has hundreds of thousands of pages of = cross reference material, all stored in databases and generated as = needed. Competitors and dealers that carry competitors brand parts seem = to think that copying our cross reference is easier then creating their = own (it would be) so code was written to block this.

YES, I KNOW that if they = are determined, they will find a way around my blockades (I=E2=80=99ve = seen quite a few variations on this: tor, AWS, other = VPNs=E2=80=A6) 

Solution: looking at the stats for the average use of the = website, we found that 95% of the site traffic visited 14 pages or less. = So=E2=80=A6 
I have a visitors.db. The system = logs all page requests tracked by IP address, and after a set amount = (more then 14 pages, but still a pretty low number) starts showing = visitors a nice Page Limit Exceeded page instead of what they were = crawling thru. After an unreasonable number of pages I just 404 them out = to save server time and bandwidth. The count resets at midnight, because = I=E2=80=99m far to lazy to track 24 hours since the first or last page = request (per IP.) In some cases, when I=E2=80=99m feeling particularly = mischievous, once a bot is detected i start feeding them fake info = :D 

Here=E2=80=99s the Visitors.db header:  (not sure if it = will help, but it is what it is)
VID= IPadd= ipperm= ipname= visitdate= pagecount= starttime= endtime= domain= firstpage= lastpage= browtype= lastsku= partner= linkin= page9= page8= page7= page6= page5= page4= page3= page2= page1


All the code that does = the tracking and counting and map/reduction to store stats and stuff is = proprietary (sorry) but I=E2=80=99ll see what (if anything) I can share = a bit later, and try to write it up as a blog post or = something.

-Brian B. Burton

On = Mar 24, 2016, at 11:41 AM, Jym Duane <jym@purposemedia.com> wrote:

=20 =20
curious how to determine...non google/bing/yahoo bots and other attempting to crawl/copy the entire site?



On 3/24/2016 9:28 AM, Brian Burton wrote:
Noah, 

Similar to you, and wanting to use pretty URLs I built something similar, but did it a different way.
_All_ page requests are caught by a url-rewrite = rule and get sent to dispatch.tpl
Dispatch.tpl has hundreds of rules that decide = what page to show, and uses includes to do it. 
(this keeps everything in-house to webdna so i = don=E2=80=99t have to go mucking about in webdna here, and apache there, and linux somewhere else, and etc=E2=80=A6) 

Three special circumstances came up that needed special code to send out proper HTTP status codes:

<!=E2=80=94 for page URLS that have permanently = moved (webdna sends out a 302 temporarily moved code on a redirect) =E2=80=94>
[function name=3D301public]
[text]eol=3D[unurl]%0D%0A[/unurl][/text]
[returnraw]HTTP/1.1 301 Moved Permanently[eol]Location: http://www.example.com[link][eol][eol][/returnraw]
[/function]

<!=E2=80=94 I send this to non = google/bing/yahoo bots and other attempting to crawl/copy the entire = site=E2=80=94>
[function name=3D404hard]
[text]eol=3D[unurl]%0D%0A[/unurl][/text]
[returnraw]HTTP/1.0 404 Not Found[eol]Status: = 404 Not Found[eol]Content-type: = text/html[eol][eol]<html>[eol]<body>[eol]<h1>404 Not Found</h1>[eol]The page that you have requested ([thisurl]) could not be found.[eol]</body>[eol]</html>[/returnraw]
[/function]

<!=E2=80=94 and finally a pretty 404 page for = humans =E2=80=94>
[function name=3D404soft]
[text]eol=3D[unurl]%0D%0A[/unurl][/text]
[returnraw]HTTP/1.0 404 Not Found[eol]Status: = 404 Not Found[eol]Content-type: text/html[eol][eol][include file=3D/404pretty.tpl][/returnraw]
[/function]

Hope this helps
-Brian B. Burton

= --Apple-Mail=_90C4CAA8-665E-4AC4-91CC-168D29778001-- . Associated Messages, from the most recent to the oldest:

    
256 --Apple-Mail=_90C4CAA8-665E-4AC4-91CC-168D29778001 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Backstory: the site is question is a replacements part business and has = hundreds of thousands of pages of cross reference material, all stored = in databases and generated as needed. Competitors and dealers that carry = competitors brand parts seem to think that copying our cross reference = is easier then creating their own (it would be) so code was written to = block this. YES, I KNOW that if they are determined, they will find a way around my = blockades (I=E2=80=99ve seen quite a few variations on this: tor, AWS, = other VPNs=E2=80=A6)=20 Solution: looking at the stats for the average use of the website, we = found that 95% of the site traffic visited 14 pages or less. So=E2=80=A6=20= I have a visitors.db. The system logs all page requests tracked by IP = address, and after a set amount (more then 14 pages, but still a pretty = low number) starts showing visitors a nice Page Limit Exceeded page = instead of what they were crawling thru. After an unreasonable number of = pages I just 404 them out to save server time and bandwidth. The count = resets at midnight, because I=E2=80=99m far to lazy to track 24 hours = since the first or last page request (per IP.) In some cases, when I=E2=80= =99m feeling particularly mischievous, once a bot is detected i start = feeding them fake info :D=20 Here=E2=80=99s the Visitors.db header: (not sure if it will help, but = it is what it is) VIDIPaddippermipnamevisitdatepagecount= starttimeendtimedomainfirstpagelastpagebrowtype= lastskupartnerlinkinpage9page8page7page6page5page4= page3page2page1 All the code that does the tracking and counting and map/reduction to = store stats and stuff is proprietary (sorry) but I=E2=80=99ll see what = (if anything) I can share a bit later, and try to write it up as a blog = post or something. -Brian B. Burton > On Mar 24, 2016, at 11:41 AM, Jym Duane wrote: >=20 > curious how to determine...non google/bing/yahoo bots and other = attempting to crawl/copy the entire site? >=20 >=20 >=20 > On 3/24/2016 9:28 AM, Brian Burton wrote: >> Noah,=20 >>=20 >> Similar to you, and wanting to use pretty URLs I built something = similar, but did it a different way. >> _All_ page requests are caught by a url-rewrite rule and get sent to = dispatch.tpl >> Dispatch.tpl has hundreds of rules that decide what page to show, and = uses includes to do it.=20 >> (this keeps everything in-house to webdna so i don=E2=80=99t have to = go mucking about in webdna here, and apache there, and linux somewhere = else, and etc=E2=80=A6)=20 >>=20 >> Three special circumstances came up that needed special code to send = out proper HTTP status codes: >>=20 >> >> [function name=3D301public] >> [text]eol=3D[unurl]%0D%0A[/unurl][/text] >> [returnraw]HTTP/1.1 301 Moved Permanently[eol]Location: = http://www.example.com[link][eol][eol][/returnraw] = >> [/function] >>=20 >> >> [function name=3D404hard] >> [text]eol=3D[unurl]%0D%0A[/unurl][/text] >> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not = Found[eol]Content-type: text/html[eol][eol][eol][eol]

404 = Not Found

[eol]The page that you have requested ([thisurl]) could = not be found.[eol][eol][/returnraw] >> [/function] >>=20 >> >> [function name=3D404soft] >> [text]eol=3D[unurl]%0D%0A[/unurl][/text] >> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not = Found[eol]Content-type: text/html[eol][eol][include = file=3D/404pretty.tpl][/returnraw] >> [/function] >>=20 >> Hope this helps >> -Brian B. Burton --------------------------------------------------------- This message is sent to you because you are subscribed to the mailing list . To unsubscribe, E-mail to: archives: http://mail.webdna.us/list/talk@webdna.us Bug Reporting: support@webdna.us --Apple-Mail=_90C4CAA8-665E-4AC4-91CC-168D29778001 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8
Backstory: the site is question is a = replacements part business and has hundreds of thousands of pages of = cross reference material, all stored in databases and generated as = needed. Competitors and dealers that carry competitors brand parts seem = to think that copying our cross reference is easier then creating their = own (it would be) so code was written to block this.

YES, I KNOW that if they = are determined, they will find a way around my blockades (I=E2=80=99ve = seen quite a few variations on this: tor, AWS, other = VPNs=E2=80=A6) 

Solution: looking at the stats for the average use of the = website, we found that 95% of the site traffic visited 14 pages or less. = So=E2=80=A6 
I have a visitors.db. The system = logs all page requests tracked by IP address, and after a set amount = (more then 14 pages, but still a pretty low number) starts showing = visitors a nice Page Limit Exceeded page instead of what they were = crawling thru. After an unreasonable number of pages I just 404 them out = to save server time and bandwidth. The count resets at midnight, because = I=E2=80=99m far to lazy to track 24 hours since the first or last page = request (per IP.) In some cases, when I=E2=80=99m feeling particularly = mischievous, once a bot is detected i start feeding them fake info = :D 

Here=E2=80=99s the Visitors.db header:  (not sure if it = will help, but it is what it is)
VID= IPadd= ipperm= ipname= visitdate= pagecount= starttime= endtime= domain= firstpage= lastpage= browtype= lastsku= partner= linkin= page9= page8= page7= page6= page5= page4= page3= page2= page1


All the code that does = the tracking and counting and map/reduction to store stats and stuff is = proprietary (sorry) but I=E2=80=99ll see what (if anything) I can share = a bit later, and try to write it up as a blog post or = something.

-Brian B. Burton

On = Mar 24, 2016, at 11:41 AM, Jym Duane <jym@purposemedia.com> wrote:

=20 =20
curious how to determine...non google/bing/yahoo bots and other attempting to crawl/copy the entire site?



On 3/24/2016 9:28 AM, Brian Burton wrote:
Noah, 

Similar to you, and wanting to use pretty URLs I built something similar, but did it a different way.
_All_ page requests are caught by a url-rewrite = rule and get sent to dispatch.tpl
Dispatch.tpl has hundreds of rules that decide = what page to show, and uses includes to do it. 
(this keeps everything in-house to webdna so i = don=E2=80=99t have to go mucking about in webdna here, and apache there, and linux somewhere else, and etc=E2=80=A6) 

Three special circumstances came up that needed special code to send out proper HTTP status codes:

<!=E2=80=94 for page URLS that have permanently = moved (webdna sends out a 302 temporarily moved code on a redirect) =E2=80=94>
[function name=3D301public]
[text]eol=3D[unurl]%0D%0A[/unurl][/text]
[returnraw]HTTP/1.1 301 Moved Permanently[eol]Location: http://www.example.com[link][eol][eol][/returnraw]
[/function]

<!=E2=80=94 I send this to non = google/bing/yahoo bots and other attempting to crawl/copy the entire = site=E2=80=94>
[function name=3D404hard]
[text]eol=3D[unurl]%0D%0A[/unurl][/text]
[returnraw]HTTP/1.0 404 Not Found[eol]Status: = 404 Not Found[eol]Content-type: = text/html[eol][eol]<html>[eol]<body>[eol]<h1>404 Not Found</h1>[eol]The page that you have requested ([thisurl]) could not be found.[eol]</body>[eol]</html>[/returnraw]
[/function]

<!=E2=80=94 and finally a pretty 404 page for = humans =E2=80=94>
[function name=3D404soft]
[text]eol=3D[unurl]%0D%0A[/unurl][/text]
[returnraw]HTTP/1.0 404 Not Found[eol]Status: = 404 Not Found[eol]Content-type: text/html[eol][eol][include file=3D/404pretty.tpl][/returnraw]
[/function]

Hope this helps
-Brian B. Burton

= --Apple-Mail=_90C4CAA8-665E-4AC4-91CC-168D29778001-- . Brian Burton

DOWNLOAD WEBDNA NOW!

Top Articles:

Talk List

The WebDNA community talk-list is the best place to get some help: several hundred extremely proficient programmers with an excellent knowledge of WebDNA and an excellent spirit will deliver all the tips and tricks you can imagine...

Related Readings:

[GetRate 1.2] - Any problem? (1999) Shownext! (1998) Causes Site to Crash... (2000) remotely add + sign (1997) Moment of Thanks (1997) [WebDNA] Brian Harrington (2019) Orders coming up blank (2004) Insert Line Feed Character (2004) Cancel Subscription (1996) Sorry if this is really stupid but.. (2000) Using WebMerchant Only? (1998) E-mail/Invoice (1998) Bug Report, maybe (1997) browser info.txt and SSL (1997) Show shoppingcart after remove last item (1997) [WebDNA] WebDNA Crashing or Not **YES** (2008) Multiple prices (1997) FISKARS (2003) RE: extended ASCII with middle command (1999) [OT] On a side note.. (2003)