D&C GLug - Home Page

[ Date Index ] [ Thread Index ] [ <= Previous by date / thread ] [ Next by date / thread => ]

[LUG] Googlebots and Large PDFs

 

Does anyone else here host fairly large (10MB+) PDF files on a web 
server? Know anyone who does?

Do you ever record Googlebots going a bit "mad" for them?

I see requests for the same file every 6 minutes, which are logged in 
Apache as 200. It will happen for a few times for the same file, over 
the weekend one Googlebot retrieved one 13MB file 13 times at six minute 
intervals.

Doesn't happen often, but it seems very wasteful when it does. Also the 
file was retrieved (200), when there is now a reverse proxy to stop 
things chewing out bandwidth, so it should have been served by the 
reverse proxy.

In the resource it spent trying to download that one PDF, the Googlebot 
could probably have reindexed vast chunks of the Internet.

Any ideas on a good place to ask?

I may just get the reverse proxy to convert such requests into refresh 
requests. Not exactly HTTP compliant, but I don't suppose anyone will 
ever notice.

-- 
The Mailing List for the Devon & Cornwall LUG
http://mailman.dclug.org.uk/listinfo/list
FAQ: http://www.dcglug.org.uk/linux_adm/list-faq.html