D&C GLug - Home Page

[ Date Index ] [ Thread Index ] [ <= Previous by date / thread ] [ Next by date / thread => ]

Re: [LUG] Trainable web site picker...

 

Dave Berkeley wrote:
>
> For simple sites, these tools, and perhaps a bit of regex handling, will
> give you everything you want. But you will have to code it instead of
> training it.

I'd done a little - and found HTML parsers painful (especially when
folks change how they do things). Although for a simple one off job I
used the Perl Simple HTML parser (HTML::TokeParser::Simple) which worked
okay.

If it is structured automated content, I found some XPath tools
excellent. The Xpath syntax isn't exactly friendly, so probably not one
for doing a one off jobs, but once you start getting your head around
it, it is like regex for HTML.

http://www.stonehenge.com/merlyn/LinuxMag/col92.html

For Tom's job, working on one site, I'd just use wget to pull an entire
copy to local disk (assuming I don't believe the generated pages will be
too big).

Spidering in general - to try and answer Dan's question - well finding
and indexing content is a solved problem, lots of tools out there. But
if you want to understand the content, unless the pages are highly
structured and following some defined standard, it still requires a
human being (for a while).

That said, I hit a problem with a file on Debian last week. First thing
I did was "dpkg --search filename" to find what package it was in, and
drew a blank. Second thing I did was a search in Google, and the first
hit in Google was a search using rpmfind to find which package the file
was from which drew a blank. So clearly Google's search engine is
thinking like I do ;)

-- 
The Mailing List for the Devon & Cornwall LUG
http://mailman.dclug.org.uk/listinfo/list
FAQ: http://www.dcglug.org.uk/linux_adm/list-faq.html