D&C GLug - Home Page

[ Date Index ] [ Thread Index ] [ <= Previous by date / thread ] [ Next by date / thread => ]

Re: [LUG] Trainable web site picker...

 

I've done a bit of website "screen scraping". It can be difficult, depending on cookies, _javascript_ etc. But for simple sites you can parse and traverse them quite quickly.

I developed a set of tools for fetching train ticket prices to allow you to break a journey down into single stages. You can often save 40% of the normal price using this technique.

The tools we used were python, urllib and url2lib, the HTML parser was BeautifulSoup, which is really easy to use.

http://www.crummy.com/software/BeautifulSoup/

Combined with the elementtree XML library, gives ElementSoup

http://effbot.org/zone/element-soup.htm

For simple sites, these tools, and perhaps a bit of regex handling, will give you everything you want. But you will have to code it instead of training it.

D

On Sunday 16 November 2008 17:21:03 Chronoppolis wrote:

> Hello,

>

> This may be my very first post (i dont remember - but yay for me). I have

> not got my linux projects to the point where i can ask concise questions ad

> so have just enjoyed the emails as a source of great interest. at some

> point i will post the various projects i am pursuing and issues i face but

> not today.

>

> This last post of yours tom particularly caught my eye as i have a very

> complicated project and a spider that would hunt through various

> supermarkets websites for me would be Unbelievably helpful - i would

> certainly be very interested in any further information you have about this

> or how one would go about it.

>

> I am a newbie programmer and am self teaching myself with a couple of

> friends as mentors, so this will be a very newbie question. What are the

> components necissary to create a spider program? is it something that has

> to be made for each site individually? if the website in question updates

> will this stop the spider from working? i have other questions but those

> are the basic ones

>

> Dan

>

> On Sun, Nov 16, 2008 at 9:17 AM, Tom Potts <tompotts@xxxxxxxxxxxxxxxxxxxx>wrote:

> > I've just been playin with Audiveris which is a well cool (showing my age

> > her)

> > Java app that takes a sheet music image and converts it to Midi or

> > musicxml so someone like me who cant seem to learn to read sheet music

> > can play scores.

> > There are quite a few archives out there with out of copyright material

> > available and I'd like to try converting a lot to MusicXML.

> > I'd like to automate the downloading of the images but get rid of the

> > detritus.

> > I want a trainable spider that I can show the 'root' page of the

> > collection,

> > click on a table or ddl and set that as the repeat action, then go down

> > to another level and get to (say composer) level, make a local directory,

> > then click to a song, make a local directory, drill down and get the

> > associated image(s), return to composer get next song, , back to root get

> > next part of collection.......

> > It occured to me something like this might also be useful for pulling

> > prices

> > from supermarket web sites for a comparison site as they seem to change

> > there

> > arrangments to try and make this difficult - 'Competition? We love it we

> > just

> > do everything we can to stop it...'

> >

> > Tom te tom te tom

> >

> >

> > --

> > The Mailing List for the Devon & Cornwall LUG

> > http://mailman.dclug.org.uk/listinfo/list

> > FAQ: http://www.dcglug.org.uk/linux_adm/list-faq.html

-- 
The Mailing List for the Devon & Cornwall LUG
http://mailman.dclug.org.uk/listinfo/list
FAQ: http://www.dcglug.org.uk/linux_adm/list-faq.html