[ Date Index ] [ Thread Index ] [ <= Previous by date / thread ] [ Next by date / thread => ]
Dave Berkeley wrote: > > For simple sites, these tools, and perhaps a bit of regex handling, will > give you everything you want. But you will have to code it instead of > training it. I'd done a little - and found HTML parsers painful (especially when folks change how they do things). Although for a simple one off job I used the Perl Simple HTML parser (HTML::TokeParser::Simple) which worked okay. If it is structured automated content, I found some XPath tools excellent. The Xpath syntax isn't exactly friendly, so probably not one for doing a one off jobs, but once you start getting your head around it, it is like regex for HTML. http://www.stonehenge.com/merlyn/LinuxMag/col92.html For Tom's job, working on one site, I'd just use wget to pull an entire copy to local disk (assuming I don't believe the generated pages will be too big). Spidering in general - to try and answer Dan's question - well finding and indexing content is a solved problem, lots of tools out there. But if you want to understand the content, unless the pages are highly structured and following some defined standard, it still requires a human being (for a while). That said, I hit a problem with a file on Debian last week. First thing I did was "dpkg --search filename" to find what package it was in, and drew a blank. Second thing I did was a search in Google, and the first hit in Google was a search using rpmfind to find which package the file was from which drew a blank. So clearly Google's search engine is thinking like I do ;) -- The Mailing List for the Devon & Cornwall LUG http://mailman.dclug.org.uk/listinfo/list FAQ: http://www.dcglug.org.uk/linux_adm/list-faq.html