D&C Lug - Home Page
Devon & Cornwall Linux Users' Group

[ Date Index ][ Thread Index ]
[ <= Previous by date / thread ] [ Next by date / thread => ]

Re: [LUG] More scripting silliness...



On Friday 04 Jul 2003 4:26 pm, Jonathan Melhuish wrote:
> On Wednesday 02 July 2003 23:50, Neil Williams wrote:
> > Technically wrong? I'd say it was stretching the rules because:
> > 1. It uses a non-existent filesystem: It's pretending (if you read the
> > URL strictly) that there are 8 sub-directories below the .biz domain
> > whereas none probably exist with the names specified (with or without the
> > = ).
>
> Yeah, if you wish to interpret the "/" delimiter as 'directories' then it
> does; but I don't see that that should pose a problem.  The URL is

I just highlighted it as a possible problem with external parsers - you 
mentioned you thought the URL was causing problems for Google. As far as 
scripts go, if you are able to isolate the search query from the domain URL 
then you could replace all / with say # before creating the relative links. 
You could change back later. Perl makes this easy, so perhaps you need a 
sequence of scripts - one Perl, one sed and back with a reverse of the first 
Perl script - you could always wrap the scripts into a bash script to leave 
one command. (Perl can do things like this on the command line too.)

> > 2. It uses non-standard repetition: It's imitating a query string and
> > then adding a real one (the xx=xx would appear to some form of
> > variable=value statement) - repetition that is likely to cause many a
> > parser to barph.
>
> Not really.  You interpreted the bit with slashes in above as a file
> location, and that seems like a fair enough conclusion.  Are you telling me
> that "=" isn't a valid character for a filename?  I had suspected that
> myself, but I can't find any evidence to support it.

Neither could I, so although I suspect it, I didn't want to state it. The way 
I see it is that [a-zA-Z0-9]=[a-zA-Z0-9] is a regular expression match for 
the usual query string variable=value format, as you conclude. In my devil's 
advocate hat, I would expect problems when an external parser (whether Google 
or a script) that knows nothing about your server configuration comes across 
a URL that, for fair enough reasons, would appear to include TWO query 
strings - or at best one badly formed query string (missing the ?) and one 
correct one. I would expect many standard compliant programs to barph at the 
first and possibly miss the second.

> > 3. Required filedata is absent: There's no 'real' file anywhere for
> > processes like Google to grab onto - I'd presume there's some index.php
> > default.asp or similar behind it but it's not stated and therefore must
> > be assumed, which is often a bad tactic.
>
> No there isn't an "index.php", the page is dynamically generated by a Perl
> server engine and passed via the "sms.ic" linker program.  No-one "assumes"
> it's presence, nor can I see it's relevance.  You got to the URL, you get
> the page.

Are there any static pages? Index page(s)? Catalogues? Dynamic pages that have 
a static URL? URL's that look like a search result aren't going to attract 
attention from Google users when displayed inside Google's own search 
results. I get the feeling from engines like Google that URL's that look like 
search results don't show up as favourably as URL's that look like a static 
page, that's all. It's more useful to me (as a Google user) to find a static 
page (perhaps a department catalogue etc.)  in the search results at Google 
and then proceed from there within your site. A URL that already contains a 
search string may look, to Google or to Google users, as "someone else's 
search" and perhaps that's why the pages aren't being indexed. A few static 
URL's may well be all that Google needs.

> > Stretching the letter of the 'rules' but breaking the spirit? Personally,
> > I wouldn't like to use an engine that relied on this type of persistence.
> >
> > I'm not surprised that it doesn't parse well with processes like Google.
> >

> <groans>  I knew I shouldn't have told you the URL ;-)

:-)

> quite a bit, isn't exactly great.  I had mistakenly assumed that any code I
> used that a 'pro' had written would be clean and standards compliant :-(

Ouch. There really isn't much in a name, especially one as over-used and 
over-played as Professional Edition or Pro Edition.

> > It would take some time to bring that page to the intended HTML4
> > Transitional standard proclaimed at the top of the page returned from
> > that
>
> You're damn right...
>
> The "?id=" bit does indeed store a unique customer number, the rest is
> stored in the database.  The rest of the URL is just the search string.  I
> don't see the problem with this approach.

Not the approach, just the way the URL uses / when other stores use 'formal' 
query strings or session ID strings. The problem you have already seen - 
creating relative links and Google non-indexing. From the website design and 
programming viewpoint, using / just seems to be asking for all sorts of 
horrible bugs and errors. It's one of those nasties that just jumps out and 
shouts "I WILL BITE!" at me. I just know that as soon as I try and do 
something non-standard with it, it'll be right there in my face like a neon 
No Entry sign. I know it's easy in Perl to not use / in search patterns, but 
it's obvious that using / is only going to cause any other scripting language 
to descend into chaos. I immediately dislike any programming trait that locks 
me into one particular way of problem-solving - whether the trait is present 
through design or negligence. 

Hence:
> > Is there a different engine available for the job?
>
> It's actually something I've been considering quite carefully, especially
> after having such serious performance problems.  The Interchange user group
> generally maintain that the performance is "satisfactory", so long as your
> hardware is up to it.  Which perhaps it is, but frankly new hardware is not
> an option at the moment, so I'm stuck with a 300Mhz Celeron that's just
> recently been downgraded to 128Mb.
>
> Mind you, I bet you Apache/MySQL could serve a few pages per second off
> even that lowly spec, so I don't see why there should be any excuse for
> such lame performance (<1 request/sec).

And this is a Pro product!! hehe. Sorry.

> OSCommerce in particular looks quite promising, I would be interested to
> hear if anybody has any experience with it.  It will definately be a
> serious contender if I develop another online store, but I'm not sure if I
> can justify the time and expense of completely ditching Interchange and the
> current SMS product database at this late stage.  But it's certainly
> tempting...
>
> Jon

After hearing so many horror stories of e-commerce tools, I won't be looking 
to gain any experience of them anytime soon!! 

OK. Enough protests. Let's get to the chase.
From your first email:

Eg. you run a search and get sent to this location:
http://www.smssat.biz/scan/fi=products/sp=results_big_thumb/st=db/co=yes/sf=category/se=
OtherReceivers/va=banner_image=/va=banner_text=.html?id=f8YyQGtr

Where there is a relative link to "./index.html", but of course that now
translates to:
http://www.smssat.biz/scan/fi=products/sp=results_big_thumb/st=db/co=yes/sf=category/se=
OtherReceivers/va=banner_image=/index.html

I've forced a line break to make it readable.

The relative link you want is:
http://www.smssat.biz/index.html
Yes?

The trouble is that it has to have the "sms.ic" bit when wget spiders it, so
that it gets the live (dynamic) version, but NOT have the "sms.ic" in the
mirrored (static) version.

By "sms.ic" do you mean the search string or the actual characters?

The actual characters are easy: (Probably what Kai was referring to when he 
basically said RTFM.)

$ cat test.pl
#!/usr/bin/perl
use strict;
my $url = "http://www.smssat.biz/sms.ic/index.html";;
print "old url: $url\n";
$url =~ s/sms\.ic\///g;
print "new Url: $url\n";

$ perl test.pl
old url: http://www.smssat.biz/sms.ic/index.html
new Url: http://www.smssat.biz/index.html

To solve the first problem - getting to http://www.smssat.biz/index.html from 
the relative link ./index.html :

#!/usr/bin/perl
use strict;
################# Variable List ###############
my $url; # the search results + query string
my $match; # the search results split off from the domain
my @matches; # array of each search result element
my $content; # holds each member of @matches in turn.
my $c; # counter
############### End variable list ##############
$url =
"http://www.smssat.biz/scan/fi=products/sp=results_big_thumb/st=db/co=yes/sf=category/se=
OtherReceivers/va=banner_image=/va=banner_text=.html?id=f8YyQGtr";
print "old url: $url\n";
$url =~ s$http://www\.smssat\.biz/(.*)$http://www.smssat.biz/$g;
print "new Url: $url\n";
$match = $1;
print "match $match\n";
$c=0;
@matches = split /\//,$match;
foreach $content (@matches) {
	$c++;
	print "Content $c: $content\n";
}

Again, a line ending has been forced that isn't in the script. (Same place in 
each case.) Note the use of the $ delimiter for the first match - it saves 
escaping all the /. That's what I meant by lock-in - it's something that Perl 
can do easily but which would cause problems in other scripting langauges. 
The split function needs the / itself so the / in the pattern is ecaped: \/. 
The . wildcard in the first match needs to be escaped to stop Perl thinking 
that the . can be replaced by any other character and still match.

Output:

perl test.pl
old url: 
http://www.smssat.biz/scan/fi=products/sp=results_big_thumb/st=db/co=yes/sf=category/se=
OtherReceivers/va=banner_image=/va=banner_text=.html?id=f8YyQGtr
new Url: http://www.smssat.biz/
match scan/fi=products/sp=results_big_thumb/st=db/co=yes/sf=category/se=
OtherReceivers/va=banner_image=/va=banner_text=.html?id=f8YyQGtr
Content 1: scan
Content 2: fi=products
Content 3: sp=results_big_thumb
Content 4: st=db
Content 5: co=yes
Content 6: sf=category
Content 7: se=OtherReceivers
Content 8: va=banner_image=
Content 9: va=banner_text=.html?id=f8YyQGtr

Is that in the right direction? I'm sure you can process each content value as 
appropriate from here and create the new static URL by a similar process.

You'd then just call the perl script as part of the copy process - in a pipe. 
The output of the live site would be piped into the input of the perl script 
which would transform it and output suitable static links to whatever process 
you want to use to write the output to files. (Perl could do the whole thing 
for you).

-- 

Neil Williams
=============
http://www.codehelp.co.uk
http://www.dclug.org.uk

http://www.wewantbroadband.co.uk/

Attachment: pgp00019.pgp
Description: signature


Lynx friendly