Re: [BULK] Re: [WebDNA] set HTTP-Status Code from webdna
This WebDNA talk-list message is from 2016
It keeps the original formatting.
numero = 112676
interpreted = N
texte = 259Chris,I=E2=80=99m happy to have people bookmark or link to my websites. Some =people however, use an actual software bot (spider), (or a human bot- =dozens of people in Russia, China or India) to go page by page and =copy/paste info off my webpages into their own product databases. (you =can tell the difference by the frequency an IP address requests pages. =Software is consistent, humans aren=E2=80=99t)=20That=E2=80=99s not to say there aren=E2=80=99t some great uses for the =session tag, but what I have is working, and I=E2=80=99m not sure how =the sessions tag would improve it.=20-Brian B. Burton> On Mar 24, 2016, at 12:50 PM, christophe.billiottet@webdna.us wrote:>=20> What about using [referrer] to allow your customers navigate your =website but disallow bookmarking and outside links? you could also use =[session] to limit the navigation to X minutes or Y pages, even for =bots, then "kick" the visitor out.>=20>=20> - chris>=20>=20>=20>=20>> On Mar 24, 2016, at 20:30, Brian Burton
wrote:>>=20>> Backstory: the site is question is a replacements part business and =has hundreds of thousands of pages of cross reference material, all =stored in databases and generated as needed. Competitors and dealers =that carry competitors brand parts seem to think that copying our cross =reference is easier then creating their own (it would be) so code was =written to block this.>>=20>> YES, I KNOW that if they are determined, they will find a way around =my blockades (I=E2=80=99ve seen quite a few variations on this: tor, =AWS, other VPNs=E2=80=A6)=20>>=20>> Solution: looking at the stats for the average use of the website, we =found that 95% of the site traffic visited 14 pages or less. So=E2=80=A6=20=>> I have a visitors.db. The system logs all page requests tracked by IP =address, and after a set amount (more then 14 pages, but still a pretty =low number) starts showing visitors a nice Page Limit Exceeded page =instead of what they were crawling thru. After an unreasonable number of =pages I just 404 them out to save server time and bandwidth. The count =resets at midnight, because I=E2=80=99m far to lazy to track 24 hours =since the first or last page request (per IP.) In some cases, when I=E2=80==99m feeling particularly mischievous, once a bot is detected i start =feeding them fake info :D=20>>=20>> Here=E2=80=99s the Visitors.db header: (not sure if it will help, =but it is what it is)>> VIDIPaddippermipnamevisitdatepagecount=starttimeendtimedomainfirstpagelastpagebrowtype=lastskupartnerlinkinpage9page8page7page6page5page4=page3page2page1>>=20>>=20>> All the code that does the tracking and counting and map/reduction to =store stats and stuff is proprietary (sorry) but I=E2=80=99ll see what =(if anything) I can share a bit later, and try to write it up as a blog =post or something.>>=20>> -Brian B. Burton>>=20>>> On Mar 24, 2016, at 11:41 AM, Jym Duane =wrote:>>>=20>>> curious how to determine...non google/bing/yahoo bots and other =attempting to crawl/copy the entire site?>>>=20>>>=20>>>=20>>> On 3/24/2016 9:28 AM, Brian Burton wrote:>>>> Noah,=20>>>>=20>>>> Similar to you, and wanting to use pretty URLs I built something =similar, but did it a different way.>>>> _All_ page requests are caught by a url-rewrite rule and get sent =to dispatch.tpl>>>> Dispatch.tpl has hundreds of rules that decide what page to show, =and uses includes to do it.=20>>>> (this keeps everything in-house to webdna so i don=E2=80=99t have =to go mucking about in webdna here, and apache there, and linux =somewhere else, and etc=E2=80=A6)=20>>>>=20>>>> Three special circumstances came up that needed special code to =send out proper HTTP status codes:>>>>=20>>>> >>>> [function name=3D301public]>>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]>>>> [returnraw]HTTP/1.1 301 Moved Permanently[eol]Location: =http://www.example.com[link][eol][eol][/returnraw]>>>> [/function]>>>>=20>>>> >>>> [function name=3D404hard]>>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]>>>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not =Found[eol]Content-type: text/html[eol][eol][eol][eol]404 =Not Found
[eol]The page that you have requested ([thisurl]) could =not be found.[eol][eol][/returnraw]>>>> [/function]>>>>=20>>>> >>>> [function name=3D404soft]>>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]>>>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not =Found[eol]Content-type: text/html[eol][eol][include =file=3D/404pretty.tpl][/returnraw]>>>> [/function]>>>>=20>>>> Hope this helps>>>> -Brian B. Burton>>=20>=20> ---------------------------------------------------------> This message is sent to you because you are subscribed to> the mailing list .> To unsubscribe, E-mail to: > archives: http://mail.webdna.us/list/talk@webdna.us> Bug Reporting: support@webdna.us---------------------------------------------------------This message is sent to you because you are subscribed tothe mailing list .To unsubscribe, E-mail to: archives: http://mail.webdna.us/list/talk@webdna.usBug Reporting: support@webdna.us.
Associated Messages, from the most recent to the oldest:
259Chris,I=E2=80=99m happy to have people bookmark or link to my websites. Some =people however, use an actual software bot (spider), (or a human bot- =dozens of people in Russia, China or India) to go page by page and =copy/paste info off my webpages into their own product databases. (you =can tell the difference by the frequency an IP address requests pages. =Software is consistent, humans aren=E2=80=99t)=20That=E2=80=99s not to say there aren=E2=80=99t some great uses for the =session tag, but what I have is working, and I=E2=80=99m not sure how =the sessions tag would improve it.=20-Brian B. Burton> On Mar 24, 2016, at 12:50 PM, christophe.billiottet@webdna.us wrote:>=20> What about using [referrer] to allow your customers navigate your =website but disallow bookmarking and outside links? you could also use =[session] to limit the navigation to X minutes or Y pages, even for =bots, then "kick" the visitor out.>=20>=20> - chris>=20>=20>=20>=20>> On Mar 24, 2016, at 20:30, Brian Burton wrote:>>=20>> Backstory: the site is question is a replacements part business and =has hundreds of thousands of pages of cross reference material, all =stored in databases and generated as needed. Competitors and dealers =that carry competitors brand parts seem to think that copying our cross =reference is easier then creating their own (it would be) so code was =written to block this.>>=20>> YES, I KNOW that if they are determined, they will find a way around =my blockades (I=E2=80=99ve seen quite a few variations on this: tor, =AWS, other VPNs=E2=80=A6)=20>>=20>> Solution: looking at the stats for the average use of the website, we =found that 95% of the site traffic visited 14 pages or less. So=E2=80=A6=20=>> I have a visitors.db. The system logs all page requests tracked by IP =address, and after a set amount (more then 14 pages, but still a pretty =low number) starts showing visitors a nice Page Limit Exceeded page =instead of what they were crawling thru. After an unreasonable number of =pages I just 404 them out to save server time and bandwidth. The count =resets at midnight, because I=E2=80=99m far to lazy to track 24 hours =since the first or last page request (per IP.) In some cases, when I=E2=80==99m feeling particularly mischievous, once a bot is detected i start =feeding them fake info :D=20>>=20>> Here=E2=80=99s the Visitors.db header: (not sure if it will help, =but it is what it is)>> VIDIPaddippermipnamevisitdatepagecount=starttimeendtimedomainfirstpagelastpagebrowtype=lastskupartnerlinkinpage9page8page7page6page5page4=page3page2page1>>=20>>=20>> All the code that does the tracking and counting and map/reduction to =store stats and stuff is proprietary (sorry) but I=E2=80=99ll see what =(if anything) I can share a bit later, and try to write it up as a blog =post or something.>>=20>> -Brian B. Burton>>=20>>> On Mar 24, 2016, at 11:41 AM, Jym Duane =wrote:>>>=20>>> curious how to determine...non google/bing/yahoo bots and other =attempting to crawl/copy the entire site?>>>=20>>>=20>>>=20>>> On 3/24/2016 9:28 AM, Brian Burton wrote:>>>> Noah,=20>>>>=20>>>> Similar to you, and wanting to use pretty URLs I built something =similar, but did it a different way.>>>> _All_ page requests are caught by a url-rewrite rule and get sent =to dispatch.tpl>>>> Dispatch.tpl has hundreds of rules that decide what page to show, =and uses includes to do it.=20>>>> (this keeps everything in-house to webdna so i don=E2=80=99t have =to go mucking about in webdna here, and apache there, and linux =somewhere else, and etc=E2=80=A6)=20>>>>=20>>>> Three special circumstances came up that needed special code to =send out proper HTTP status codes:>>>>=20>>>> >>>> [function name=3D301public]>>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]>>>> [returnraw]HTTP/1.1 301 Moved Permanently[eol]Location: =http://www.example.com[link][eol][eol][/returnraw]>>>> [/function]>>>>=20>>>> >>>> [function name=3D404hard]>>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]>>>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not =Found[eol]Content-type: text/html[eol][eol][eol][eol]404 =Not Found
[eol]The page that you have requested ([thisurl]) could =not be found.[eol][eol][/returnraw]>>>> [/function]>>>>=20>>>> >>>> [function name=3D404soft]>>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]>>>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not =Found[eol]Content-type: text/html[eol][eol][include =file=3D/404pretty.tpl][/returnraw]>>>> [/function]>>>>=20>>>> Hope this helps>>>> -Brian B. Burton>>=20>=20> ---------------------------------------------------------> This message is sent to you because you are subscribed to> the mailing list .> To unsubscribe, E-mail to: > archives: http://mail.webdna.us/list/talk@webdna.us> Bug Reporting: support@webdna.us---------------------------------------------------------This message is sent to you because you are subscribed tothe mailing list .To unsubscribe, E-mail to: archives: http://mail.webdna.us/list/talk@webdna.usBug Reporting: support@webdna.us.
Brian Burton
DOWNLOAD WEBDNA NOW!
Top Articles:
Talk List
The WebDNA community talk-list is the best place to get some help: several hundred extremely proficient programmers with an excellent knowledge of WebDNA and an excellent spirit will deliver all the tips and tricks you can imagine...
Related Readings:
WebCatalog and WebTen (1997)
[ShowIf] and empty fields (1997)
Webcatalog, Webstar and Crasharama (1999)
setitems, one more thing (1997)
[SHOWIF]s and empty arguments (1997)
OT: SSL Certs (2005)
Kaaaaahhhhhhhnnnnnnn! (1997)
triggers (2000)
[setcookie] & [redirect] (1998)
Roundup function? (1997)
Re:listfiles-looking for slick solution (1997)
Summing fields (1997)
[GROUPS] followup (1997)
Need help with emailer- 2 issues (1997)
Stripping attachments in shared pop (2005)
Major bug report on rootbeer (1997)
make updates in a temporary database (2004)
Using WC for Bulk Emailings (1997)
Bad Cookie (1998)
Server Slowdown (1999)