What is WebDNA

WebDNA is a scripting and database system designed to easily build web applications.

WebDNA and BioType

BioType service is a biometric keystroke dynamic system. It will be part of WebDNA 8.5

Download WebDNA

Download WebDNA freeware, try it and register later if you want.

WebDNA resources

The list of all WebDNA instructions.

WebDNA

Software Corporation

Search WebDNA Site

HOME

DOWNLOADS

LEARN

EDUCATION

NEWS

COMMUNITY

STORE

SUPPORT

CONTACT

Re: [BULK] Re: [WebDNA] set HTTP-Status Code from webdna

This WebDNA talk-list message is from

2016

It keeps the original formatting. numero = 112677
interpreted = N
texte = 260--001a11c3c6d2451a77052ecf47d1Content-Type: text/plain; charset=UTF-8Content-Transfer-Encoding: quoted-printableHi all,Thought I would add my approach to 'pretty' urls using mod_rewrite ratherthan routing through an error document.Basically everything except images and folders/files that I specify arerouted to 'parser.tmpl'. That template then parses the URL and you can thesearch databases, include files etc.Here's a sample htaccess file with all the mod_rewrite stuff and some otherthings that people might find useful.- TomPS. This a great resource on what can be done using the htaccess filehttps://github.com/h5bp/html5-boilerplate/blob/master/dist/.htaccess# Better website experience for IEHeader set X-UA-Compatible "IE=3Dedge"Header unset X-UA-CompatibleDirectoryIndex index.html index.tmpl# Proper MIME types for all filesAddType application/javascript jsAddType application/json jsonAddType video/mp4 mp4 m4v f4v f4pAddType video/x-flv flvAddType application/font-woff woffAddType application/vnd.ms-fontobject eotAddType application/x-font-ttf ttc ttfAddType font/opentype otfAddType image/svg+xml svg svgzAddEncoding gzip svgzAddType application/x-shockwave-flash swfAddType application/xml atom rdf rss xmlAddType image/x-icon icoAddType text/vtt vttAddType text/x-component htcAddType text/x-vcard vcfAddType text/csv csv# UTF-8 encodingAddDefaultCharset utf-8AddCharset utf-8 .atom .css .js .json .rss .vtt .webapp .xml# Security - Block access to directories without a default documentOptions -Indexes# Block access to backup and source filesOrder allow,denyDeny from allSatisfy All# Rewrite engineRewriteEngine On# Redirect to Main 'www' DomainRewriteCond %{HTTP_HOST} ^yourdomain\.com [NC]RewriteRule ^(.*)$ http://www.yourdomain.com/$1 [R=3D301,NC,L]# Exclude these directories and files from rewriteRewriteRule ^(admin|otherdirectories|parser\.tmpl|robots\.txt)($|/) - [L]# Exclude images from rewriteRewriteCond %{REQUEST_URI} !\.(gif|jp?g|png|css|ico) [NC]# Route everything else through parser.tmplRewriteRule ./parser.tmpl?requestedurl=3D%{REQUEST_URI}&query=3D%{QUERY_STRING}&serverpo=rt=3D%{SERVER_PORT}[L]=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D==3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DDigital Revolutionaries1st Floor, Castleriver House14-15 Parliament StreetTemple Bar,Dublin 2Ireland----------------------------------------------[t]: + 353 1 4403907[e]: [w]: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D==3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DOn 24 March 2016 at 17:50, wrote:> What about using [referrer] to allow your customers navigate your website> but disallow bookmarking and outside links? you could also use [session] =to> limit the navigation to X minutes or Y pages, even for bots, then "kick"> the visitor out.>>> - chris>>>>> > On Mar 24, 2016, at 20:30, Brian Burton wrote:> >> > Backstory: the site is question is a replacements part business and has> hundreds of thousands of pages of cross reference material, all stored in> databases and generated as needed. Competitors and dealers that carry> competitors brand parts seem to think that copying our cross reference is> easier then creating their own (it would be) so code was written to block> this.> >> > YES, I KNOW that if they are determined, they will find a way around my> blockades (I=E2=80=99ve seen quite a few variations on this: tor, AWS, ot=her VPNs=E2=80=A6)> >> > Solution: looking at the stats for the average use of the website, we> found that 95% of the site traffic visited 14 pages or less. So=E2=80=A6> > I have a visitors.db. The system logs all page requests tracked by IP> address, and after a set amount (more then 14 pages, but still a pretty l=ow> number) starts showing visitors a nice Page Limit Exceeded page instead o=f> what they were crawling thru. After an unreasonable number of pages I jus=t> 404 them out to save server time and bandwidth. The count resets at> midnight, because I=E2=80=99m far to lazy to track 24 hours since the fir=st or last> page request (per IP.) In some cases, when I=E2=80=99m feeling particular=ly> mischievous, once a bot is detected i start feeding them fake info :D> >> > Here=E2=80=99s the Visitors.db header: (not sure if it will help, but =it is> what it is)> > VID IPadd ipperm ipname visitdate pagecount starttime> endtime domain firstpage lastpage browtype> lastsku partner linkin page9 page8 page7 page6 page5 page4> page3 page2 page1> >> >> > All the code that does the tracking and counting and map/reduction to> store stats and stuff is proprietary (sorry) but I=E2=80=99ll see what (i=f> anything) I can share a bit later, and try to write it up as a blog post =or> something.> >> > -Brian B. Burton> >> >> On Mar 24, 2016, at 11:41 AM, Jym Duane wrote:> >>> >> curious how to determine...non google/bing/yahoo bots and other> attempting to crawl/copy the entire site?> >>> >>> >>> >> On 3/24/2016 9:28 AM, Brian Burton wrote:> >>> Noah,> >>>> >>> Similar to you, and wanting to use pretty URLs I built something> similar, but did it a different way.> >>> _All_ page requests are caught by a url-rewrite rule and get sent to> dispatch.tpl> >>> Dispatch.tpl has hundreds of rules that decide what page to show, and> uses includes to do it.> >>> (this keeps everything in-house to webdna so i don=E2=80=99t have to =go> mucking about in webdna here, and apache there, and linux somewhere else,> and etc=E2=80=A6)> >>>> >>> Three special circumstances came up that needed special code to send> out proper HTTP status codes:> >>>> >>> temporarily moved code on a redirect) =E2=80=94>> >>> [function name=3D301public]> >>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]> >>> [returnraw]HTTP/1.1 301 Moved Permanently[eol]Location:> http://www.example.com[link][eol][eol][/returnraw]> >>> [/function]> >>>> >>> crawl/copy the entire site=E2=80=94>> >>> [function name=3D404hard]> >>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]> >>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not> Found[eol]Content-type: text/html[eol][eol][eol][eol]

404> Not Found

[eol]The page that you have requested ([thisurl]) could not> be found.[eol][eol][/returnraw]> >>> [/function]> >>>> >>> > >>> [function name=3D404soft]> >>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]> >>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not> Found[eol]Content-type: text/html[eol][eol][include> file=3D/404pretty.tpl][/returnraw]> >>> [/function]> >>>> >>> Hope this helps> >>> -Brian B. Burton> >>> ---------------------------------------------------------> This message is sent to you because you are subscribed to> the mailing list .> To unsubscribe, E-mail to: > archives: http://mail.webdna.us/list/talk@webdna.us> Bug Reporting: support@webdna.us>---------------------------------------------------------This message is sent to you because you are subscribed tothe mailing list .To unsubscribe, E-mail to: archives: http://mail.webdna.us/list/talk@webdna.usBug Reporting: support@webdna.us--001a11c3c6d2451a77052ecf47d1Content-Type: text/html; charset=UTF-8Content-Transfer-Encoding: quoted-printable

Hi all,

Thought I would add my approach= to 'pretty' urls using mod_rewrite rather than routing through an =error document. =C2=A0

Basically everything except= images and folders/files that I specify are routed to 'parser.tmpl'=;.=C2=A0 That template then parses the URL and you can the search databases=, include files etc.

Here's a sample htaccess =file with all the mod_rewrite stuff and some other things that people might= find useful. =C2=A0

- Tom

PS. This a great resource on what can be do=ne using the htaccess file

https://github.com/h5bp/html5-bo=ilerplate/blob/master/dist/.htaccess

<=/div>

# Better website experience for IE

Header set X-UA-Compatible "IE=3Dedge"<=/div>

<FilesMatch "\.(appca=che|crx|css|eot|gif|htc|ico|jpe?g|js|m4a|m4v|manifest|mp4|oex|oga|ogg|ogv|o=tf|pdf|png|safariextz|svgz?|ttf|vcf|webapp|webm|webp|woff|xml|xpi)$"&g=t;

Header unset X-UA-Compatible

</FilesMatch>

DirectoryIndex index.html index.tmpl

<=font face=3D"monospace, monospace">

# Proper MIME types for all files

AddType application/javascript =C2=A0 =C2=A0 ==C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0js

AddType application/json =C2=A0 =C2=A0 =C2=A0 ==C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0json

AddType video/mp4 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ==C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2==A0 mp4 m4v f4v f4p

Ad=dType video/x-flv =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ==C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 flv

AddType application/font-woff =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ==C2=A0 =C2=A0 =C2=A0 =C2=A0 woff

AddType application/vnd.ms-fontobject =C2=A0 =C2=A0 =C2=A0 =C2==A0 =C2=A0 eot

AddType= application/x-font-ttf =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ==C2=A0 =C2=A0ttc ttf

A=ddType font/opentype =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2==A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 otf

AddType image/svg+xml =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2==A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 svg svgz=

AddEncoding gzip =C2=A0 =C2==A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ==C2=A0 =C2=A0 =C2=A0 =C2=A0svgz

AddType= application/x-shockwave-flash =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 swf

AddType application/xml =C2==A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ==C2=A0 atom rdf rss xml

AddType image/x-icon =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2==A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ico

AddType text/vtt =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2==A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ==C2=A0vtt

AddType text=/x-component =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0htc

AddType text/x-vcard =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ==C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vcf

AddType text/csv =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2==A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0csv

# UTF-8 encoding

AddDefaultCharset utf-8

AddCharset utf-8 .atom .css .js .json .rss .vtt= .webapp .xml

# Security - Block access= to directories without a default document

Options -Indexes

#= Block access to backup and source files

Order =allow,deny

Deny from all

=Satisfy All

&l=t;/FilesMatch>

# Rewrite engine

RewriteEngine On

# Redirect to Main 'www' Domain

RewriteCond %{HTTP_HOST} ^your=domain\.com [NC]

Rewri=teRule ^(.*)$ http://www.yourdomai=n.com/$1 [R=3D301,NC,L]

# Exclude t=hese directories and files from rewrite

RewriteRule ^(admin|otherdirectories|parser\.tmpl|robots\=..txt)($|/) - [L]

<=/font>

# Exclude images from =rewrite

RewriteCond %{=REQUEST_URI} !\.(gif|jp?g|png|css|ico) [NC]

# Route everything else through parser.tmpl

RewriteRule . /parser.tmpl?requestedurl=3D%{REQU=EST_URI}&query=3D%{QUERY_STRING}&serverport=3D%{SERVER_PORT} [L]

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D==3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D==3D=3D=3D=3D=3D=3D
Digital Revolutionaries
1st Floor, Castleriver Hou=se
14-15 Parliament Street
Temple Bar,Dublin 2
Ireland
--------=--------------------------------------
[t]: + 353 1 4403907
[e]: <=mailto:tom@revo=lutionaries.ie>
[w]: <http://www.revolutionaries.ie/>
=3D=3D=3D=3D==3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D==3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

On 24 March 2016 at 17:50, <=christophe.billiottet@webdna.us> wrote:

What about using [referrer] to allow your customers navigate =your website but disallow bookmarking and outside links? you could also use= [session] to limit the navigation to X minutes or Y pages, even for bots, =then "kick" the visitor out.

- chris

> On Mar 24, 2016, at 20:30, Brian Burton <brian@burtons.com> wrote:
>
> Backstory: the site is question is a replacements part business and ha=s hundreds of thousands of pages of cross reference material, all stored in= databases and generated as needed. Competitors and dealers that carry comp=etitors brand parts seem to think that copying our cross reference is easie=r then creating their own (it would be) so code was written to block this.<=br>>
> YES, I KNOW that if they are determined, they will find a way around m=y blockades (I=E2=80=99ve seen quite a few variations on this: tor, AWS, ot=her VPNs=E2=80=A6)
>
> Solution: looking at the stats for the average use of the website, we =found that 95% of the site traffic visited 14 pages or less. So=E2=80=A6> I have a visitors.db. The system logs all page requests tracked by IP =address, and after a set amount (more then 14 pages, but still a pretty low= number) starts showing visitors a nice Page Limit Exceeded page instead of= what they were crawling thru. After an unreasonable number of pages I just= 404 them out to save server time and bandwidth. The count resets at midnig=ht, because I=E2=80=99m far to lazy to track 24 hours since the first or la=st page request (per IP.) In some cases, when I=E2=80=99m feeling particula=rly mischievous, once a bot is detected i start feeding them fake info :D>
> Here=E2=80=99s the Visitors.db header:=C2=A0 (not sure if it will help=, but it is what it is)
> VID=C2=A0 =C2=A0IPadd=C2=A0 =C2=A0ipperm=C2=A0 ipname=C2=A0 visitdate==C2=A0 =C2=A0 =C2=A0 =C2=A0pagecount=C2=A0 =C2=A0 =C2=A0 =C2=A0starttime=C2==A0 =C2=A0 =C2=A0 =C2=A0endtime domain=C2=A0 firstpage=C2=A0 =C2=A0 =C2=A0 ==C2=A0lastpage=C2=A0 =C2=A0 =C2=A0 =C2=A0 browtype=C2=A0 =C2=A0 =C2=A0 =C2==A0 lastsku partner linkin=C2=A0 page9=C2=A0 =C2=A0page8=C2=A0 =C2=A0page7==C2=A0 =C2=A0page6=C2=A0 =C2=A0page5=C2=A0 =C2=A0page4=C2=A0 =C2=A0page3=C2==A0 =C2=A0page2=C2=A0 =C2=A0page1
>
>
> All the code that does the tracking and counting and map/reduction to =store stats and stuff is proprietary (sorry) but I=E2=80=99ll see what (if =anything) I can share a bit later, and try to write it up as a blog post or= something.
>
> -Brian B. Burton
>
>> On Mar 24, 2016, at 11:41 AM, Jym Duane <jym@purposemedia.com> wrote:
>>
>> curious how to determine...non google/bing/yahoo bots and other at=tempting to crawl/copy the entire site?
>>
>>
>>
>> On 3/24/2016 9:28 AM, Brian Burton wrote:
>>> Noah,
>>>
>>> Similar to you, and wanting to use pretty URLs I built somethi=ng similar, but did it a different way.
>>> _All_ page requests are caught by a url-rewrite rule and get s=ent to dispatch.tpl
>>> Dispatch.tpl has hundreds of rules that decide what page to sh=ow, and uses includes to do it.
>>> (this keeps everything in-house to webdna so i don=E2=80=99t h=ave to go mucking about in webdna here, and apache there, and linux somewhe=re else, and etc=E2=80=A6)
>>>
>>> Three special circumstances came up that needed special code t=o send out proper HTTP status codes:
>>>
>>> <!=E2=80=94 for page URLS that have permanently moved (webd=na sends out a 302 temporarily moved code on a redirect) =E2=80=94>
>>> [function name=3D301public]
>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]
>>> [returnraw]HTTP/1.1 301 Moved Permanently[eol]Location: http://ww=w.example.com[link][eol][eol][/returnraw]
>>> [/function]
>>>
>>> <!=E2=80=94 I send this to non google/bing/yahoo bots and o=ther attempting to crawl/copy the entire site=E2=80=94>
>>> [function name=3D404hard]
>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]
>>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not Found[eo=l]Content-type: text/html[eol][eol]<html>[eol]<body>[eol]<h1=>404 Not Found</h1>[eol]The page that you have requested ([thisurl=]) could not be found.[eol]</body>[eol]</html>[/returnraw]
>>> [/function]
>>>
>>> <!=E2=80=94 and finally a pretty 404 page for humans =E2=80==94>
>>> [function name=3D404soft]
>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]
>>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not Found[eo=l]Content-type: text/html[eol][eol][include file=3D/404pretty.tpl][/returnr=aw]
>>> [/function]
>>>
>>> Hope this helps
>>> -Brian B. Burton
>

----------------------------------------------=-----------
This message is sent to you because you are subscribed to
the mailing list <talk@webdna.u=s>.
To unsubscribe, E-mail to: <talk=-leave@webdna.us>

archives: http://ma=il.webdna.us/list/talk@webdna.us
Bug Reporting: support@webdna.us

--001a11c3c6d2451a77052ecf47d1--. Associated Messages, from the most recent to the oldest:

260--001a11c3c6d2451a77052ecf47d1Content-Type: text/plain; charset=UTF-8Content-Transfer-Encoding: quoted-printableHi all,Thought I would add my approach to 'pretty' urls using mod_rewrite ratherthan routing through an error document.Basically everything except images and folders/files that I specify arerouted to 'parser.tmpl'. That template then parses the URL and you can thesearch databases, include files etc.Here's a sample htaccess file with all the mod_rewrite stuff and some otherthings that people might find useful.- TomPS. This a great resource on what can be done using the htaccess filehttps://github.com/h5bp/html5-boilerplate/blob/master/dist/.htaccess# Better website experience for IEHeader set X-UA-Compatible "IE=3Dedge"Header unset X-UA-CompatibleDirectoryIndex index.html index.tmpl# Proper MIME types for all filesAddType application/javascript jsAddType application/json jsonAddType video/mp4 mp4 m4v f4v f4pAddType video/x-flv flvAddType application/font-woff woffAddType application/vnd.ms-fontobject eotAddType application/x-font-ttf ttc ttfAddType font/opentype otfAddType image/svg+xml svg svgzAddEncoding gzip svgzAddType application/x-shockwave-flash swfAddType application/xml atom rdf rss xmlAddType image/x-icon icoAddType text/vtt vttAddType text/x-component htcAddType text/x-vcard vcfAddType text/csv csv# UTF-8 encodingAddDefaultCharset utf-8AddCharset utf-8 .atom .css .js .json .rss .vtt .webapp .xml# Security - Block access to directories without a default documentOptions -Indexes# Block access to backup and source filesOrder allow,denyDeny from allSatisfy All# Rewrite engineRewriteEngine On# Redirect to Main 'www' DomainRewriteCond %{HTTP_HOST} ^yourdomain\.com [NC]RewriteRule ^(.*)$ http://www.yourdomain.com/$1 [R=3D301,NC,L]# Exclude these directories and files from rewriteRewriteRule ^(admin|otherdirectories|parser\.tmpl|robots\.txt)($|/) - [L]# Exclude images from rewriteRewriteCond %{REQUEST_URI} !\.(gif|jp?g|png|css|ico) [NC]# Route everything else through parser.tmplRewriteRule ./parser.tmpl?requestedurl=3D%{REQUEST_URI}&query=3D%{QUERY_STRING}&serverpo=rt=3D%{SERVER_PORT}[L]=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D==3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DDigital Revolutionaries1st Floor, Castleriver House14-15 Parliament StreetTemple Bar,Dublin 2Ireland----------------------------------------------[t]: + 353 1 4403907[e]: [w]: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D==3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DOn 24 March 2016 at 17:50, wrote:> What about using [referrer] to allow your customers navigate your website> but disallow bookmarking and outside links? you could also use [session] =to> limit the navigation to X minutes or Y pages, even for bots, then "kick"> the visitor out.>>> - chris>>>>> > On Mar 24, 2016, at 20:30, Brian Burton wrote:> >> > Backstory: the site is question is a replacements part business and has> hundreds of thousands of pages of cross reference material, all stored in> databases and generated as needed. Competitors and dealers that carry> competitors brand parts seem to think that copying our cross reference is> easier then creating their own (it would be) so code was written to block> this.> >> > YES, I KNOW that if they are determined, they will find a way around my> blockades (I=E2=80=99ve seen quite a few variations on this: tor, AWS, ot=her VPNs=E2=80=A6)> >> > Solution: looking at the stats for the average use of the website, we> found that 95% of the site traffic visited 14 pages or less. So=E2=80=A6> > I have a visitors.db. The system logs all page requests tracked by IP> address, and after a set amount (more then 14 pages, but still a pretty l=ow> number) starts showing visitors a nice Page Limit Exceeded page instead o=f> what they were crawling thru. After an unreasonable number of pages I jus=t> 404 them out to save server time and bandwidth. The count resets at> midnight, because I=E2=80=99m far to lazy to track 24 hours since the fir=st or last> page request (per IP.) In some cases, when I=E2=80=99m feeling particular=ly> mischievous, once a bot is detected i start feeding them fake info :D> >> > Here=E2=80=99s the Visitors.db header: (not sure if it will help, but =it is> what it is)> > VID IPadd ipperm ipname visitdate pagecount starttime> endtime domain firstpage lastpage browtype> lastsku partner linkin page9 page8 page7 page6 page5 page4> page3 page2 page1> >> >> > All the code that does the tracking and counting and map/reduction to> store stats and stuff is proprietary (sorry) but I=E2=80=99ll see what (i=f> anything) I can share a bit later, and try to write it up as a blog post =or> something.> >> > -Brian B. Burton> >> >> On Mar 24, 2016, at 11:41 AM, Jym Duane wrote:> >>> >> curious how to determine...non google/bing/yahoo bots and other> attempting to crawl/copy the entire site?> >>> >>> >>> >> On 3/24/2016 9:28 AM, Brian Burton wrote:> >>> Noah,> >>>> >>> Similar to you, and wanting to use pretty URLs I built something> similar, but did it a different way.> >>> _All_ page requests are caught by a url-rewrite rule and get sent to> dispatch.tpl> >>> Dispatch.tpl has hundreds of rules that decide what page to show, and> uses includes to do it.> >>> (this keeps everything in-house to webdna so i don=E2=80=99t have to =go> mucking about in webdna here, and apache there, and linux somewhere else,> and etc=E2=80=A6)> >>>> >>> Three special circumstances came up that needed special code to send> out proper HTTP status codes:> >>>> >>> temporarily moved code on a redirect) =E2=80=94>> >>> [function name=3D301public]> >>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]> >>> [returnraw]HTTP/1.1 301 Moved Permanently[eol]Location:> http://www.example.com[link][eol][eol][/returnraw]> >>> [/function]> >>>> >>> crawl/copy the entire site=E2=80=94>> >>> [function name=3D404hard]> >>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]> >>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not> Found[eol]Content-type: text/html[eol][eol][eol][eol]