D&C GLug - Home Page

[ Date Index ] [ Thread Index ] [ <= Previous by date / thread ] [ Next by date / thread => ]

Re: [LUG] Googlebots and Large PDFs

 

Tom Potts wrote:
> 
> try a robots.txt to stop them searching the things

robots.txt is rather difficult in the circumstances

I could use mod_rewrite, but for that I'd want to know when/how 
Googlebots start messing up (I suspect it is a timeout thing so depends 
on free bandwidth, since it is managing 3 to 4MB before stopping). But 
since Googlebot often managed to index 8MB files it would cost 
functionality.

Hence I hoped to find if others are seeing same, so I can go to Google 
and say "these googlebots are burning your bandwidth unnecessarily....", 
or if I can find other who definitely don't see it, I might figure out 
what is different.

> Also consider your data format - is it necessary to have your data encrypted 
> in PDF's? 

Not my data.

I need to make the service robust against anything folks put in a PDF 
file. I don't really care of it is 13MB of random numbers, it should 
still be served using HTTP correctly and efficiently.

Of course Googlebots may be very disappointed when they get the 13MB to 
find it is password protected - but that doesn't account for what is 
going wrong on the layers below.


-- 
The Mailing List for the Devon & Cornwall LUG
http://mailman.dclug.org.uk/listinfo/list
FAQ: http://www.dcglug.org.uk/linux_adm/list-faq.html