[ Date Index ] [ Thread Index ] [ <= Previous by date / thread ] [ Next by date / thread => ]
I'm looking for a script writer who can organise a cron job on their own system to update one or more isbnsearch portals on a regular (twice-monthly) basis. The script would start with a portal site by requesting a predictable query URL. If that URL produces a result, the resulting page will contain a predictable link to a second remote site - the indexserver - which will collect the results and return another page. Another predictable link returns control to the portal and looks for the next set of results. If no results are found, the portal produces a page that links to the next query. Most of it can be probably be done with wget and some pattern matching but it's not quite that simple. For each pair of servers, each query handles 1000 entries (to keep within server timeout limits) and there are currently over 37,000 entries on the indexserver to try to match. Hits vary from 0 to 40 per 1,000, median 6. There is the usual HTTP delay in getting each page. I won't publish the URL's here, it'll only cause unwanted google search results on a conditional query. I'm currently doing these updates via a browser. The HTML produced for the browser is entirely predictable and should be suitable for pattern matching. Changes to the pages to get a script to work are not practicable. The PHP scripts on the servers will deal with all the actual data, all that's needed is an intelligent, remote, script to call the correct URL's, dependent on the results for each set of 1,000 and keep going until the portal returns a page without a link (indicating the end of the run). Each run is different and calling the URL's *permanently* updates the data at both ends of the query so simply repeating the URL will NOT generate the same result page (it's likely to generate no results at all). New URL's or new data files cannot be injected into the process without fully configuring both ends. This makes debugging such a script, *interesting*, and as such, it is NOT a job for a bash newbie. I say bash, but another scripting language could be used - just don't expect me to help you debug it if you don't use bash or perl!!!! :-) The nature of each run means that it is possible to resume a run later or start a run at a specific point, in case of failure. Once data has been committed, it's usually at least a week before a subsequent run is likely to find any result data. The more often the runs, the less results will be found (and the more bandwidth is wasted between the servers). Hence twice a month (at most). I've chosen to stall the current run so that anyone interested can have some at least some real data files to save to hard disc for testing purposes. The script only deals with URL's and pattern matches on HTML, do *not* access the data itself. You can choose whatever licence you prefer - the server scripts themselves are GPL. If you choose to use GPL, your script could be included in the main project for others to use, with appropriate credits. Contact me off-list for the URL's. -- Neil Williams ============= http://www.data-freedom.org/ http://www.nosoftwarepatents.com/ http://www.linux.codehelp.co.uk/
Attachment:
pgpz8cpeG4NxR.pgp
Description: PGP signature