Re: [BULK] Re: [WebDNA] set HTTP-Status Code from webdna

This WebDNA talk-list message is from

2016


It keeps the original formatting.
numero = 112676
interpreted = N
texte = 259 Chris, I=E2=80=99m happy to have people bookmark or link to my websites. Some = people however, use an actual software bot (spider), (or a human bot- = dozens of people in Russia, China or India) to go page by page and = copy/paste info off my webpages into their own product databases. (you = can tell the difference by the frequency an IP address requests pages. = Software is consistent, humans aren=E2=80=99t)=20 That=E2=80=99s not to say there aren=E2=80=99t some great uses for the = session tag, but what I have is working, and I=E2=80=99m not sure how = the sessions tag would improve it.=20 -Brian B. Burton > On Mar 24, 2016, at 12:50 PM, christophe.billiottet@webdna.us wrote: >=20 > What about using [referrer] to allow your customers navigate your = website but disallow bookmarking and outside links? you could also use = [session] to limit the navigation to X minutes or Y pages, even for = bots, then "kick" the visitor out. >=20 >=20 > - chris >=20 >=20 >=20 >=20 >> On Mar 24, 2016, at 20:30, Brian Burton wrote: >>=20 >> Backstory: the site is question is a replacements part business and = has hundreds of thousands of pages of cross reference material, all = stored in databases and generated as needed. Competitors and dealers = that carry competitors brand parts seem to think that copying our cross = reference is easier then creating their own (it would be) so code was = written to block this. >>=20 >> YES, I KNOW that if they are determined, they will find a way around = my blockades (I=E2=80=99ve seen quite a few variations on this: tor, = AWS, other VPNs=E2=80=A6)=20 >>=20 >> Solution: looking at the stats for the average use of the website, we = found that 95% of the site traffic visited 14 pages or less. So=E2=80=A6=20= >> I have a visitors.db. The system logs all page requests tracked by IP = address, and after a set amount (more then 14 pages, but still a pretty = low number) starts showing visitors a nice Page Limit Exceeded page = instead of what they were crawling thru. After an unreasonable number of = pages I just 404 them out to save server time and bandwidth. The count = resets at midnight, because I=E2=80=99m far to lazy to track 24 hours = since the first or last page request (per IP.) In some cases, when I=E2=80= =99m feeling particularly mischievous, once a bot is detected i start = feeding them fake info :D=20 >>=20 >> Here=E2=80=99s the Visitors.db header: (not sure if it will help, = but it is what it is) >> VIDIPaddippermipnamevisitdatepagecount= starttimeendtimedomainfirstpagelastpagebrowtype= lastskupartnerlinkinpage9page8page7page6page5page4= page3page2page1 >>=20 >>=20 >> All the code that does the tracking and counting and map/reduction to = store stats and stuff is proprietary (sorry) but I=E2=80=99ll see what = (if anything) I can share a bit later, and try to write it up as a blog = post or something. >>=20 >> -Brian B. Burton >>=20 >>> On Mar 24, 2016, at 11:41 AM, Jym Duane = wrote: >>>=20 >>> curious how to determine...non google/bing/yahoo bots and other = attempting to crawl/copy the entire site? >>>=20 >>>=20 >>>=20 >>> On 3/24/2016 9:28 AM, Brian Burton wrote: >>>> Noah,=20 >>>>=20 >>>> Similar to you, and wanting to use pretty URLs I built something = similar, but did it a different way. >>>> _All_ page requests are caught by a url-rewrite rule and get sent = to dispatch.tpl >>>> Dispatch.tpl has hundreds of rules that decide what page to show, = and uses includes to do it.=20 >>>> (this keeps everything in-house to webdna so i don=E2=80=99t have = to go mucking about in webdna here, and apache there, and linux = somewhere else, and etc=E2=80=A6)=20 >>>>=20 >>>> Three special circumstances came up that needed special code to = send out proper HTTP status codes: >>>>=20 >>>> >>>> [function name=3D301public] >>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text] >>>> [returnraw]HTTP/1.1 301 Moved Permanently[eol]Location: = http://www.example.com[link][eol][eol][/returnraw] >>>> [/function] >>>>=20 >>>> >>>> [function name=3D404hard] >>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text] >>>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not = Found[eol]Content-type: text/html[eol][eol][eol][eol]

404 = Not Found

[eol]The page that you have requested ([thisurl]) could = not be found.[eol][eol][/returnraw] >>>> [/function] >>>>=20 >>>> >>>> [function name=3D404soft] >>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text] >>>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not = Found[eol]Content-type: text/html[eol][eol][include = file=3D/404pretty.tpl][/returnraw] >>>> [/function] >>>>=20 >>>> Hope this helps >>>> -Brian B. Burton >>=20 >=20 > --------------------------------------------------------- > This message is sent to you because you are subscribed to > the mailing list . > To unsubscribe, E-mail to: > archives: http://mail.webdna.us/list/talk@webdna.us > Bug Reporting: support@webdna.us --------------------------------------------------------- This message is sent to you because you are subscribed to the mailing list . To unsubscribe, E-mail to: archives: http://mail.webdna.us/list/talk@webdna.us Bug Reporting: support@webdna.us . Associated Messages, from the most recent to the oldest:

    
259 Chris, I=E2=80=99m happy to have people bookmark or link to my websites. Some = people however, use an actual software bot (spider), (or a human bot- = dozens of people in Russia, China or India) to go page by page and = copy/paste info off my webpages into their own product databases. (you = can tell the difference by the frequency an IP address requests pages. = Software is consistent, humans aren=E2=80=99t)=20 That=E2=80=99s not to say there aren=E2=80=99t some great uses for the = session tag, but what I have is working, and I=E2=80=99m not sure how = the sessions tag would improve it.=20 -Brian B. Burton > On Mar 24, 2016, at 12:50 PM, christophe.billiottet@webdna.us wrote: >=20 > What about using [referrer] to allow your customers navigate your = website but disallow bookmarking and outside links? you could also use = [session] to limit the navigation to X minutes or Y pages, even for = bots, then "kick" the visitor out. >=20 >=20 > - chris >=20 >=20 >=20 >=20 >> On Mar 24, 2016, at 20:30, Brian Burton wrote: >>=20 >> Backstory: the site is question is a replacements part business and = has hundreds of thousands of pages of cross reference material, all = stored in databases and generated as needed. Competitors and dealers = that carry competitors brand parts seem to think that copying our cross = reference is easier then creating their own (it would be) so code was = written to block this. >>=20 >> YES, I KNOW that if they are determined, they will find a way around = my blockades (I=E2=80=99ve seen quite a few variations on this: tor, = AWS, other VPNs=E2=80=A6)=20 >>=20 >> Solution: looking at the stats for the average use of the website, we = found that 95% of the site traffic visited 14 pages or less. So=E2=80=A6=20= >> I have a visitors.db. The system logs all page requests tracked by IP = address, and after a set amount (more then 14 pages, but still a pretty = low number) starts showing visitors a nice Page Limit Exceeded page = instead of what they were crawling thru. After an unreasonable number of = pages I just 404 them out to save server time and bandwidth. The count = resets at midnight, because I=E2=80=99m far to lazy to track 24 hours = since the first or last page request (per IP.) In some cases, when I=E2=80= =99m feeling particularly mischievous, once a bot is detected i start = feeding them fake info :D=20 >>=20 >> Here=E2=80=99s the Visitors.db header: (not sure if it will help, = but it is what it is) >> VIDIPaddippermipnamevisitdatepagecount= starttimeendtimedomainfirstpagelastpagebrowtype= lastskupartnerlinkinpage9page8page7page6page5page4= page3page2page1 >>=20 >>=20 >> All the code that does the tracking and counting and map/reduction to = store stats and stuff is proprietary (sorry) but I=E2=80=99ll see what = (if anything) I can share a bit later, and try to write it up as a blog = post or something. >>=20 >> -Brian B. Burton >>=20 >>> On Mar 24, 2016, at 11:41 AM, Jym Duane = wrote: >>>=20 >>> curious how to determine...non google/bing/yahoo bots and other = attempting to crawl/copy the entire site? >>>=20 >>>=20 >>>=20 >>> On 3/24/2016 9:28 AM, Brian Burton wrote: >>>> Noah,=20 >>>>=20 >>>> Similar to you, and wanting to use pretty URLs I built something = similar, but did it a different way. >>>> _All_ page requests are caught by a url-rewrite rule and get sent = to dispatch.tpl >>>> Dispatch.tpl has hundreds of rules that decide what page to show, = and uses includes to do it.=20 >>>> (this keeps everything in-house to webdna so i don=E2=80=99t have = to go mucking about in webdna here, and apache there, and linux = somewhere else, and etc=E2=80=A6)=20 >>>>=20 >>>> Three special circumstances came up that needed special code to = send out proper HTTP status codes: >>>>=20 >>>> >>>> [function name=3D301public] >>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text] >>>> [returnraw]HTTP/1.1 301 Moved Permanently[eol]Location: = http://www.example.com[link][eol][eol][/returnraw] >>>> [/function] >>>>=20 >>>> >>>> [function name=3D404hard] >>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text] >>>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not = Found[eol]Content-type: text/html[eol][eol][eol][eol]

404 = Not Found

[eol]The page that you have requested ([thisurl]) could = not be found.[eol][eol][/returnraw] >>>> [/function] >>>>=20 >>>> >>>> [function name=3D404soft] >>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text] >>>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not = Found[eol]Content-type: text/html[eol][eol][include = file=3D/404pretty.tpl][/returnraw] >>>> [/function] >>>>=20 >>>> Hope this helps >>>> -Brian B. Burton >>=20 >=20 > --------------------------------------------------------- > This message is sent to you because you are subscribed to > the mailing list . > To unsubscribe, E-mail to: > archives: http://mail.webdna.us/list/talk@webdna.us > Bug Reporting: support@webdna.us --------------------------------------------------------- This message is sent to you because you are subscribed to the mailing list . To unsubscribe, E-mail to: archives: http://mail.webdna.us/list/talk@webdna.us Bug Reporting: support@webdna.us . Brian Burton

DOWNLOAD WEBDNA NOW!

Top Articles:

Talk List

The WebDNA community talk-list is the best place to get some help: several hundred extremely proficient programmers with an excellent knowledge of WebDNA and an excellent spirit will deliver all the tips and tricks you can imagine...

Related Readings:

WebCatalog and WebTen (1997) [ShowIf] and empty fields (1997) Webcatalog, Webstar and Crasharama (1999) setitems, one more thing (1997) [SHOWIF]s and empty arguments (1997) OT: SSL Certs (2005) Kaaaaahhhhhhhnnnnnnn! (1997) triggers (2000) [setcookie] & [redirect] (1998) Roundup function? (1997) Re:listfiles-looking for slick solution (1997) Summing fields (1997) [GROUPS] followup (1997) Need help with emailer- 2 issues (1997) Stripping attachments in shared pop (2005) Major bug report on rootbeer (1997) make updates in a temporary database (2004) Using WC for Bulk Emailings (1997) Bad Cookie (1998) Server Slowdown (1999)