D&C GLug - Home Page

[ Date Index ] [ Thread Index ] [ <= Previous by date / thread ] [ Next by date / thread => ]

Re: [LUG] Googlebots and Large PDFs

 

On Monday 06 October 2008 10:29, Simon Waters wrote:
> Does anyone else here host fairly large (10MB+) PDF files on a web
> server? Know anyone who does?
>
> Do you ever record Googlebots going a bit "mad" for them?
>
> I see requests for the same file every 6 minutes, which are logged in
> Apache as 200. It will happen for a few times for the same file, over
> the weekend one Googlebot retrieved one 13MB file 13 times at six minute
> intervals.
>
> Doesn't happen often, but it seems very wasteful when it does. Also the
> file was retrieved (200), when there is now a reverse proxy to stop
> things chewing out bandwidth, so it should have been served by the
> reverse proxy.
>
> In the resource it spent trying to download that one PDF, the Googlebot
> could probably have reindexed vast chunks of the Internet.
>
> Any ideas on a good place to ask?
>
> I may just get the reverse proxy to convert such requests into refresh
> requests. Not exactly HTTP compliant, but I don't suppose anyone will
> ever notice.
try a robots.txt to stop them searching the things
Also consider your data format - is it necessary to have your data encrypted 
in PDF's? 
Tom te tom te tom


-- 
The Mailing List for the Devon & Cornwall LUG
http://mailman.dclug.org.uk/listinfo/list
FAQ: http://www.dcglug.org.uk/linux_adm/list-faq.html