[BULK] Re: [WebDNA] set HTTP-Status Code from webdna

This WebDNA talk-list message is from

2016

It keeps the original formatting. numero = 112673
interpreted = N
texte = 256--Apple-Mail=_90C4CAA8-665E-4AC4-91CC-168D29778001Content-Transfer-Encoding: quoted-printableContent-Type: text/plain;charset=utf-8Backstory: the site is question is a replacements part business and has =hundreds of thousands of pages of cross reference material, all stored =in databases and generated as needed. Competitors and dealers that carry =competitors brand parts seem to think that copying our cross reference =is easier then creating their own (it would be) so code was written to =block this.YES, I KNOW that if they are determined, they will find a way around my =blockades (I=E2=80=99ve seen quite a few variations on this: tor, AWS, =other VPNs=E2=80=A6)=20Solution: looking at the stats for the average use of the website, we =found that 95% of the site traffic visited 14 pages or less. So=E2=80=A6=20=I have a visitors.db. The system logs all page requests tracked by IP =address, and after a set amount (more then 14 pages, but still a pretty =low number) starts showing visitors a nice Page Limit Exceeded page =instead of what they were crawling thru. After an unreasonable number of =pages I just 404 them out to save server time and bandwidth. The count =resets at midnight, because I=E2=80=99m far to lazy to track 24 hours =since the first or last page request (per IP.) In some cases, when I=E2=80==99m feeling particularly mischievous, once a bot is detected i start =feeding them fake info :D=20Here=E2=80=99s the Visitors.db header: (not sure if it will help, but =it is what it is)VIDIPaddippermipnamevisitdatepagecount=starttimeendtimedomainfirstpagelastpagebrowtype=lastskupartnerlinkinpage9page8page7page6page5page4=page3page2page1All the code that does the tracking and counting and map/reduction to =store stats and stuff is proprietary (sorry) but I=E2=80=99ll see what =(if anything) I can share a bit later, and try to write it up as a blog =post or something.-Brian B. Burton> On Mar 24, 2016, at 11:41 AM, Jym Duane wrote:>=20> curious how to determine...non google/bing/yahoo bots and other =attempting to crawl/copy the entire site?>=20>=20>=20> On 3/24/2016 9:28 AM, Brian Burton wrote:>> Noah,=20>>=20>> Similar to you, and wanting to use pretty URLs I built something =similar, but did it a different way.>> _All_ page requests are caught by a url-rewrite rule and get sent to =dispatch.tpl>> Dispatch.tpl has hundreds of rules that decide what page to show, and =uses includes to do it.=20>> (this keeps everything in-house to webdna so i don=E2=80=99t have to =go mucking about in webdna here, and apache there, and linux somewhere =else, and etc=E2=80=A6)=20>>=20>> Three special circumstances came up that needed special code to send =out proper HTTP status codes:>>=20>> >> [function name=3D301public]>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]>> [returnraw]HTTP/1.1 301 Moved Permanently[eol]Location: =http://www.example.com[link][eol][eol][/returnraw] =>> [/function]>>=20>> >> [function name=3D404hard]>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not =Found[eol]Content-type: text/html[eol][eol][eol][eol]

404 =Not Found

[eol]The page that you have requested ([thisurl]) could =not be found.[eol][eol][/returnraw]>> [/function]>>=20>> >> [function name=3D404soft]>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not =Found[eol]Content-type: text/html[eol][eol][include =file=3D/404pretty.tpl][/returnraw]>> [/function]>>=20>> Hope this helps>> -Brian B. Burton---------------------------------------------------------This message is sent to you because you are subscribed tothe mailing list .To unsubscribe, E-mail to: archives: http://mail.webdna.us/list/talk@webdna.usBug Reporting: support@webdna.us--Apple-Mail=_90C4CAA8-665E-4AC4-91CC-168D29778001Content-Transfer-Encoding: quoted-printableContent-Type: text/html;charset=utf-8

Backstory: the site is question is a =replacements part business and has hundreds of thousands of pages of =cross reference material, all stored in databases and generated as =needed. Competitors and dealers that carry competitors brand parts seem =to think that copying our cross reference is easier then creating their =own (it would be) so code was written to block this.

YES, I KNOW that if they =are determined, they will find a way around my blockades (I=E2=80=99ve =seen quite a few variations on this: tor, AWS, other =VPNs=E2=80=A6)

Solution: looking at the stats for the average use of the =website, we found that 95% of the site traffic visited 14 pages or less. =So=E2=80=A6

I have a visitors.db. The system =logs all page requests tracked by IP address, and after a set amount =(more then 14 pages, but still a pretty low number) starts showing =visitors a nice Page Limit Exceeded page instead of what they were =crawling thru. After an unreasonable number of pages I just 404 them out =to save server time and bandwidth. The count resets at midnight, because =I=E2=80=99m far to lazy to track 24 hours since the first or last page =request (per IP.) In some cases, when I=E2=80=99m feeling particularly =mischievous, once a bot is detected i start feeding them fake info =:D

Here=E2=80=99s the Visitors.db header: (not sure if it =will help, but it is what it is)

VID=IPadd=ipperm=ipname=visitdate=pagecount=starttime=endtime=domain=firstpage=lastpage=browtype=lastsku=partner=linkin=page9=page8=page7=page6=page5=page4=page3=page2=page1

All the code that does =the tracking and counting and map/reduction to store stats and stuff is =proprietary (sorry) but I=E2=80=99ll see what (if anything) I can share =a bit later, and try to write it up as a blog post or =something.

-Brian B. Burton

On =Mar 24, 2016, at 11:41 AM, Jym Duane <jym@purposemedia.com> wrote:

=20 =20
curious how to determine...non google/bing/yahoo bots and other attempting to crawl/copy the entire site?

On 3/24/2016 9:28 AM, Brian Burton wrote:

Noah,

Similar to you, and wanting to use pretty URLs I built something similar, but did it a different way.

_All_ page requests are caught by a url-rewrite =rule and get sent to dispatch.tpl

Dispatch.tpl has hundreds of rules that decide =what page to show, and uses includes to do it.

(this keeps everything in-house to webdna so i =don=E2=80=99t have to go mucking about in webdna here, and apache there, and linux somewhere else, and etc=E2=80=A6)

Three special circumstances came up that needed special code to send out proper HTTP status codes:

<!=E2=80=94 for page URLS that have permanently =moved (webdna sends out a 302 temporarily moved code on a redirect) =E2=80=94>

[function name=3D301public]

[text]eol=3D[unurl]%0D%0A[/unurl][/text]

[returnraw]HTTP/1.1 301 Moved Permanently[eol]Location: http://www.example.com[link][eol][eol][/returnraw]

[/function]

<!=E2=80=94 I send this to non =google/bing/yahoo bots and other attempting to crawl/copy the entire =site=E2=80=94>

[function name=3D404hard]

[text]eol=3D[unurl]%0D%0A[/unurl][/text]

[returnraw]HTTP/1.0 404 Not Found[eol]Status: =404 Not Found[eol]Content-type: =text/html[eol][eol]<html>[eol]<body>[eol]<h1>404 Not Found</h1>[eol]The page that you have requested ([thisurl]) could not be found.[eol]</body>[eol]</html>[/returnraw]

[/function]

<!=E2=80=94 and finally a pretty 404 page for =humans =E2=80=94>

[function name=3D404soft]

[text]eol=3D[unurl]%0D%0A[/unurl][/text]

[returnraw]HTTP/1.0 404 Not Found[eol]Status: =404 Not Found[eol]Content-type: text/html[eol][eol][include file=3D/404pretty.tpl][/returnraw]

[/function]

Hope this helps

-Brian B. Burton

=--Apple-Mail=_90C4CAA8-665E-4AC4-91CC-168D29778001--. Associated Messages, from the most recent to the oldest:

256--Apple-Mail=_90C4CAA8-665E-4AC4-91CC-168D29778001Content-Transfer-Encoding: quoted-printableContent-Type: text/plain;charset=utf-8Backstory: the site is question is a replacements part business and has =hundreds of thousands of pages of cross reference material, all stored =in databases and generated as needed. Competitors and dealers that carry =competitors brand parts seem to think that copying our cross reference =is easier then creating their own (it would be) so code was written to =block this.YES, I KNOW that if they are determined, they will find a way around my =blockades (I=E2=80=99ve seen quite a few variations on this: tor, AWS, =other VPNs=E2=80=A6)=20Solution: looking at the stats for the average use of the website, we =found that 95% of the site traffic visited 14 pages or less. So=E2=80=A6=20=I have a visitors.db. The system logs all page requests tracked by IP =address, and after a set amount (more then 14 pages, but still a pretty =low number) starts showing visitors a nice Page Limit Exceeded page =instead of what they were crawling thru. After an unreasonable number of =pages I just 404 them out to save server time and bandwidth. The count =resets at midnight, because I=E2=80=99m far to lazy to track 24 hours =since the first or last page request (per IP.) In some cases, when I=E2=80==99m feeling particularly mischievous, once a bot is detected i start =feeding them fake info :D=20Here=E2=80=99s the Visitors.db header: (not sure if it will help, but =it is what it is)VIDIPaddippermipnamevisitdatepagecount=starttimeendtimedomainfirstpagelastpagebrowtype=lastskupartnerlinkinpage9page8page7page6page5page4=page3page2page1All the code that does the tracking and counting and map/reduction to =store stats and stuff is proprietary (sorry) but I=E2=80=99ll see what =(if anything) I can share a bit later, and try to write it up as a blog =post or something.-Brian B. Burton> On Mar 24, 2016, at 11:41 AM, Jym Duane wrote:>=20> curious how to determine...non google/bing/yahoo bots and other =attempting to crawl/copy the entire site?>=20>=20>=20> On 3/24/2016 9:28 AM, Brian Burton wrote:>> Noah,=20>>=20>> Similar to you, and wanting to use pretty URLs I built something =similar, but did it a different way.>> _All_ page requests are caught by a url-rewrite rule and get sent to =dispatch.tpl>> Dispatch.tpl has hundreds of rules that decide what page to show, and =uses includes to do it.=20>> (this keeps everything in-house to webdna so i don=E2=80=99t have to =go mucking about in webdna here, and apache there, and linux somewhere =else, and etc=E2=80=A6)=20>>=20>> Three special circumstances came up that needed special code to send =out proper HTTP status codes:>>=20>> >> [function name=3D301public]>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]>> [returnraw]HTTP/1.1 301 Moved Permanently[eol]Location: =http://www.example.com[link][eol][eol][/returnraw] =>> [/function]>>=20>> >> [function name=3D404hard]>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not =Found[eol]Content-type: text/html[eol][eol][eol][eol]

404 =Not Found

YES, I KNOW that if they =are determined, they will find a way around my blockades (I=E2=80=99ve =seen quite a few variations on this: tor, AWS, other =VPNs=E2=80=A6)

Solution: looking at the stats for the average use of the =website, we found that 95% of the site traffic visited 14 pages or less. =So=E2=80=A6

Here=E2=80=99s the Visitors.db header: (not sure if it =will help, but it is what it is)

VID=IPadd=ipperm=ipname=visitdate=pagecount=starttime=endtime=domain=firstpage=lastpage=browtype=lastsku=partner=linkin=page9=page8=page7=page6=page5=page4=page3=page2=page1

-Brian B. Burton

On =Mar 24, 2016, at 11:41 AM, Jym Duane <jym@purposemedia.com> wrote:

=20 =20
curious how to determine...non google/bing/yahoo bots and other attempting to crawl/copy the entire site?

On 3/24/2016 9:28 AM, Brian Burton wrote:

Noah,

Similar to you, and wanting to use pretty URLs I built something similar, but did it a different way.

_All_ page requests are caught by a url-rewrite =rule and get sent to dispatch.tpl

Dispatch.tpl has hundreds of rules that decide =what page to show, and uses includes to do it.

(this keeps everything in-house to webdna so i =don=E2=80=99t have to go mucking about in webdna here, and apache there, and linux somewhere else, and etc=E2=80=A6)

Three special circumstances came up that needed special code to send out proper HTTP status codes:

<!=E2=80=94 for page URLS that have permanently =moved (webdna sends out a 302 temporarily moved code on a redirect) =E2=80=94>

[function name=3D301public]

[text]eol=3D[unurl]%0D%0A[/unurl][/text]

[returnraw]HTTP/1.1 301 Moved Permanently[eol]Location: http://www.example.com[link][eol][eol][/returnraw]

[/function]

<!=E2=80=94 I send this to non =google/bing/yahoo bots and other attempting to crawl/copy the entire =site=E2=80=94>

[function name=3D404hard]

[text]eol=3D[unurl]%0D%0A[/unurl][/text]

[returnraw]HTTP/1.0 404 Not Found[eol]Status: =404 Not Found[eol]Content-type: =text/html[eol][eol]<html>[eol]<body>[eol]<h1>404 Not Found</h1>[eol]The page that you have requested ([thisurl]) could not be found.[eol]</body>[eol]</html>[/returnraw]

[/function]

<!=E2=80=94 and finally a pretty 404 page for =humans =E2=80=94>

[function name=3D404soft]

[text]eol=3D[unurl]%0D%0A[/unurl][/text]

[returnraw]HTTP/1.0 404 Not Found[eol]Status: =404 Not Found[eol]Content-type: text/html[eol][eol][include file=3D/404pretty.tpl][/returnraw]

[/function]

Hope this helps

-Brian B. Burton

=--Apple-Mail=_90C4CAA8-665E-4AC4-91CC-168D29778001--. Brian Burton

DOWNLOAD WEBDNA NOW!

[BULK] Re: [WebDNA] set HTTP-Status Code from webdna

2016

404 =Not Found

404 =Not Found

Top Articles:

Related Readings:

[BULK] Re: [WebDNA] set HTTP-Status Code from webdna

2016

404 = Not Found

404 = Not Found

Top Articles:

Related Readings:

404 =Not Found

404 =Not Found