D&C Lug - Home Page
Devon & Cornwall Linux Users' Group

[ Date Index ][ Thread Index ]
[ <= Previous by date / thread ] [ Next by date / thread => ]

Re: [LUG] Python and remote files



On Fri, Feb 21, 2003 at 10:18:17 +0000, Neil Williams wrote:
> How do I open a remote HTTP file using Perl?
> The ordinary open(HANDLE,"$file"); doesn't work ( -r $file ) reports unable to 
> read.
> I want to load a page from a separate server (actually execute a search and 
> read the results using the query string using a URL from a bookmark) and read 
> the file into the script for analysis. I can't open static pages at the 
> moment, but I would expect that opening a dynamic page wouldn't differ once 
> the process is done.
> Anyone with ideas?

use python; :)

Urllib in python is very easy. Retrieves the file to /tmp, so you can
open and play.
Here is a little script that demonstrates this, and I go on to search
for all full urls with regular expression, and it finds:
http://www.w3.org/TR/REC-html40/loose.dtd
http://devoncornwall.pm.org/
http://www.southwestlug.uklinux.net/
http://www.lug.org.uk

Been doing this stuff very recently when I wrote a plugin for a blog to
retrieve images in image urls and generate a thumbnail for them. See:
http://db.cs.helsinki.fi/~hendry/log/


#!/usr/bin/env python2
import urllib, re
                                                                                                                                                                       
url = 'http://www.dclug.org.uk/'
fileurlpattern = r'(?:http|https|file|ftp)\:+[\/\-\_\.\w]+[\/\w][\?\&\+\=\%\w\/\-\_\.]*'
                                                                                                                                                                       
f=open(urllib.urlretrieve(url)[0])
                                                                                                                                                                       
s = f.read() # read contents of file into string
                                                                                                                                                                       
for i in re.finditer(fileurlpattern, s):
        print i.group()


--
The Mailing List for the Devon & Cornwall LUG
Mail majordomo@xxxxxxxxxxxx with "unsubscribe list" in the
message body to unsubscribe.


Lynx friendly