D&C GLug - Home Page

[ Date Index ] [ Thread Index ] [ <= Previous by date / thread ] [ Next by date / thread => ]

[LUG] bash volunteer

 

I'm looking for a script writer who can organise a cron job on their own 
system to update one or more isbnsearch portals on a regular (twice-monthly) 
basis.

The script would start with a portal site by requesting a predictable query 
URL. If that URL produces a result, the resulting page will contain a 
predictable link to a second remote site - the indexserver - which will 
collect the results and return another page. Another predictable link returns 
control to the portal and looks for the next set of results. If no results 
are found, the portal produces a page that links to the next query.

Most of it can be probably be done with wget and some pattern matching but 
it's not quite that simple.

For each pair of servers, each query handles 1000 entries (to keep within 
server timeout limits) and there are currently over 37,000 entries on the 
indexserver to try to match. Hits vary from 0 to 40 per 1,000, median 6.

There is the usual HTTP delay in getting each page.

I won't publish the URL's here, it'll only cause unwanted google search 
results on a conditional query.

I'm currently doing these updates via a browser. The HTML produced for the 
browser is entirely predictable and should be suitable for pattern matching. 
Changes to the pages to get a script to work are not practicable. 

The PHP scripts on the servers will deal with all the actual data, all that's 
needed is an intelligent, remote, script to call the correct URL's, dependent 
on the results for each set of 1,000 and keep going until the portal returns 
a page without a link (indicating the end of the run).

Each run is different and calling the URL's *permanently* updates the data at 
both ends of the query so simply repeating the URL will NOT generate the same 
result page (it's likely to generate no results at all). New URL's or new 
data files cannot be injected into the process without fully configuring both 
ends.

This makes debugging such a script, *interesting*, and as such, it is NOT a 
job for a bash newbie. I say bash, but another scripting language could be 
used - just don't expect me to help you debug it if you don't use bash or 
perl!!!!

:-)

The nature of each run means that it is possible to resume a run later or 
start a run at a specific point, in case of failure.

Once data has been committed, it's usually at least a week before a subsequent 
run is likely to find any result data. The more often the runs, the less 
results will be found (and the more bandwidth is wasted between the servers). 
Hence twice a month (at most).

I've chosen to stall the current run so that anyone interested can have some 
at least some real data files to save to hard disc for testing purposes.

The script only deals with URL's and pattern matches on HTML,  do *not* access 
the data itself. You can choose whatever licence you prefer - the server 
scripts themselves are GPL. If you choose to use GPL, your script could be 
included in the main project for others to use, with appropriate credits.

Contact me off-list for the URL's.

-- 

Neil Williams
=============
http://www.data-freedom.org/
http://www.nosoftwarepatents.com/
http://www.linux.codehelp.co.uk/

Attachment: pgpz8cpeG4NxR.pgp
Description: PGP signature