Re: [LUG] More scripting silliness...

To: list@xxxxxxxxxxxx
Subject: Re: [LUG] More scripting silliness...
From: Jonathan Melhuish <jon@xxxxxxxxxxxxxxxx>
Date: Fri, 4 Jul 2003 23:38:21 +0100
Content-description: clearsigned data
Content-disposition: inline
Reply-to: list@xxxxxxxxxxxx
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Friday 04 July 2003 20:52, Neil Williams wrote:
> On Friday 04 Jul 2003 4:26 pm, Jonathan Melhuish wrote:
> > On Wednesday 02 July 2003 23:50, Neil Williams wrote:
> > > Technically wrong? I'd say it was stretching the rules because:
> > > 1. It uses a non-existent filesystem: It's pretending (if you read the
> > > URL strictly) that there are 8 sub-directories below the .biz domain
> > > whereas none probably exist with the names specified (with or without
> > > the = ).
> >
> > Yeah, if you wish to interpret the "/" delimiter as 'directories' then it
> > does; but I don't see that that should pose a problem.  The URL is
>
> I just highlighted it as a possible problem with external parsers - you
> mentioned you thought the URL was causing problems for Google. As far as
> scripts go, if you are able to isolate the search query from the domain URL
> then you could replace all / with say # before creating the relative links.
> You could change back later. Perl makes this easy, so perhaps you need a
> sequence of scripts - one Perl, one sed and back with a reverse of the
> first Perl script - you could always wrap the scripts into a bash script to
> leave one command. (Perl can do things like this on the command line too.)

You may well be right.  It's difficult to find out through trial and error as 
Google doesn't seem to spider my site very regularly. :-(  But in case it 
doesn't like the "=", I've added a link to "List all products", which is a 
plain HTML page linking with reasonably plain links to reasonably plain HTML 
product information pages.  Can you see any reason why it wouldn't index 
that?

> > > 2. It uses non-standard repetition: It's imitating a query string and
> > > then adding a real one (the xx=xx would appear to some form of
> > > variable=value statement) - repetition that is likely to cause many a
> > > parser to barph.
> >
> > Not really.  You interpreted the bit with slashes in above as a file
> > location, and that seems like a fair enough conclusion.  Are you telling
> > me that "=" isn't a valid character for a filename?  I had suspected that
> > myself, but I can't find any evidence to support it.
>
> Neither could I, so although I suspect it, I didn't want to state it. The
> way I see it is that [a-zA-Z0-9]=[a-zA-Z0-9] is a regular expression match
> for the usual query string variable=value format, as you conclude. In my
> devil's advocate hat, I would expect problems when an external parser
> (whether Google or a script) that knows nothing about your server
> configuration comes across a URL that, for fair enough reasons, would
> appear to include TWO query strings - or at best one badly formed query
> string (missing the ?) and one correct one. I would expect many standard
> compliant programs to barph at the first and possibly miss the second.
>
> > > 3. Required filedata is absent: There's no 'real' file anywhere for
> > > processes like Google to grab onto - I'd presume there's some index.php
> > > default.asp or similar behind it but it's not stated and therefore must
> > > be assumed, which is often a bad tactic.
> >
> > No there isn't an "index.php", the page is dynamically generated by a
> > Perl server engine and passed via the "sms.ic" linker program.  No-one
> > "assumes" it's presence, nor can I see it's relevance.  You got to the
> > URL, you get the page.
>
> Are there any static pages? Index page(s)? Catalogues? Dynamic pages that
> have a static URL? URL's that look like a search result aren't going to
> attract attention from Google users when displayed inside Google's own
> search results. I get the feeling from engines like Google that URL's that
> look like search results don't show up as favourably as URL's that look
> like a static page, that's all. It's more useful to me (as a Google user)
> to find a static page (perhaps a department catalogue etc.)  in the search
> results at Google and then proceed from there within your site. A URL that
> already contains a search string may look, to Google or to Google users, as
> "someone else's search" and perhaps that's why the pages aren't being
> indexed. A few static URL's may well be all that Google needs.

That's an interesting point, and not one that I'd previously considered.  The 
dynamic category-listing pages are indeed at a static location, the URL just 
looks like it's a search.  Which it is, but a static one, if you catch my 
drift!

As I say, I've put an actually-really static page in with links to all of the 
products, and the product information pages appear to be static HTML pages 
anyway.

> > > Stretching the letter of the 'rules' but breaking the spirit?
> > > Personally, I wouldn't like to use an engine that relied on this type
> > > of persistence.
> > >
> > > I'm not surprised that it doesn't parse well with processes like
> > > Google.
> >
> > <groans>  I knew I shouldn't have told you the URL ;-)
> >
> :-)
> :
> > quite a bit, isn't exactly great.  I had mistakenly assumed that any code
> > I used that a 'pro' had written would be clean and standards compliant
> > :-(
>
> Ouch. There really isn't much in a name, especially one as over-used and
> over-played as Professional Edition or Pro Edition.
>
> > > It would take some time to bring that page to the intended HTML4
> > > Transitional standard proclaimed at the top of the page returned from
> > > that
> >
> > You're damn right...
> >
> > The "?id=" bit does indeed store a unique customer number, the rest is
> > stored in the database.  The rest of the URL is just the search string. 
> > I don't see the problem with this approach.
>
> Not the approach, just the way the URL uses / when other stores use
> 'formal' query strings or session ID strings. The problem you have already
> seen - creating relative links and Google non-indexing. From the website
> design and programming viewpoint, using / just seems to be asking for all
> sorts of horrible bugs and errors. It's one of those nasties that just
> jumps out and shouts "I WILL BITE!" at me. I just know that as soon as I
> try and do something non-standard with it, it'll be right there in my face
> like a neon No Entry sign. I know it's easy in Perl to not use / in search
> patterns, but it's obvious that using / is only going to cause any other
> scripting language to descend into chaos. I immediately dislike any
> programming trait that locks me into one particular way of problem-solving
> - whether the trait is present through design or negligence.

Yup, I don't like it either :-(

> Hence:
> > > Is there a different engine available for the job?
> >
> > It's actually something I've been considering quite carefully, especially
> > after having such serious performance problems.  The Interchange user
> > group generally maintain that the performance is "satisfactory", so long
> > as your hardware is up to it.  Which perhaps it is, but frankly new
> > hardware is not an option at the moment, so I'm stuck with a 300Mhz
> > Celeron that's just recently been downgraded to 128Mb.
> >
> > Mind you, I bet you Apache/MySQL could serve a few pages per second off
> > even that lowly spec, so I don't see why there should be any excuse for
> > such lame performance (<1 request/sec).
>
> And this is a Pro product!! hehe. Sorry.

I remember being worried about the performance whilst I began developing with 
Interchange, several laptops ago, on a P166 with 96Mb of RAM, but assured 
myself that it would be much quicker when I put it on a "reasonably specced" 
server, which seemed to be what the Interchange user group still maintain.  
And to be fair, it delivered okay-ish performance when it had 256Mb of RAM.  
But even then it wasn't exactly serving thousands of hits per minute...

> > OSCommerce in particular looks quite promising, I would be interested to
> > hear if anybody has any experience with it.  It will definately be a
> > serious contender if I develop another online store, but I'm not sure if
> > I can justify the time and expense of completely ditching Interchange and
> > the current SMS product database at this late stage.  But it's certainly
> > tempting...
> >
> > Jon
>
> After hearing so many horror stories of e-commerce tools, I won't be
> looking to gain any experience of them anytime soon!!

I still think it *can* be a better option that re-inventing the wheel 
yourself, but it's tricky to suss out the pros and cons of different 
ecommerce suites without actually using them.  And by the time you've 
developed your store, tested it a bit, let your client loose on it, accepted 
some orders... it's too late to switch!  :-(

> OK. Enough protests. Let's get to the chase.
> From your first email:
>
> Eg. you run a search and get sent to this location:
> http://www.smssat.biz/scan/fi=products/sp=results_big_thumb/st=db/co=yes/sf
>=category/se=
> OtherReceivers/va=banner_image=/va=banner_text=.html?id=f8YyQGtr
>
> Where there is a relative link to "./index.html", but of course that now
> translates to:
> http://www.smssat.biz/scan/fi=products/sp=results_big_thumb/st=db/co=yes/sf
>=category/se= OtherReceivers/va=banner_image=/index.html
>
> I've forced a line break to make it readable.
>
> The relative link you want is:
> http://www.smssat.biz/index.html
> Yes?
>
> The trouble is that it has to have the "sms.ic" bit when wget spiders it,
> so that it gets the live (dynamic) version, but NOT have the "sms.ic" in
> the mirrored (static) version.
>
> By "sms.ic" do you mean the search string or the actual characters?
>
> The actual characters are easy: (Probably what Kai was referring to when he
> basically said RTFM.)

Yup, I know it's easy, I should have RTFM.  Sorry.

> $ cat test.pl
> #!/usr/bin/perl
> use strict;
> my $url = "http://www.smssat.biz/sms.ic/index.html";;
> print "old url: $url\n";
> $url =~ s/sms\.ic\///g;
> print "new Url: $url\n";
>
> $ perl test.pl
> old url: http://www.smssat.biz/sms.ic/index.html
> new Url: http://www.smssat.biz/index.html

I should have posted this earlier to save, I didn't think anybody was actually 
going to help me so I didn't bother ;-)  I know Perl's probably better for 
pattern-matching etc., but as I was developing a BASH script I thought I'd 
just stick it in there.  So I just used 'sed' like this:

find . -type f -name '*' |while read -r AFILE
do
 sed '/sms.ic\//s///g' "$AFILE" > "$AFILE.stripped";
 cp "$AFILE.stripped" "$AFILE";
done

Piping the output straight back to the same file seemed to turn them all 
zero-length, I don't quite understand why, but the above seems to work.

> To solve the first problem - getting to http://www.smssat.biz/index.html
> from the relative link ./index.html :
>
> #!/usr/bin/perl
> use strict;
> ################# Variable List ###############
> my $url; # the search results + query string
> my $match; # the search results split off from the domain
> my @matches; # array of each search result element
> my $content; # holds each member of @matches in turn.
> my $c; # counter
> ############### End variable list ##############
> $url =
> "http://www.smssat.biz/scan/fi=products/sp=results_big_thumb/st=db/co=yes/s
>f=category/se=
> OtherReceivers/va=banner_image=/va=banner_text=.html?id=f8YyQGtr"; print
> "old url: $url\n";
> $url =~ s$http://www\.smssat\.biz/(.*)$http://www.smssat.biz/$g;
> print "new Url: $url\n";
> $match = $1;
> print "match $match\n";
> $c=0;
> @matches = split /\//,$match;
> foreach $content (@matches) {
> 	$c++;
> 	print "Content $c: $content\n";
> }
>
> Again, a line ending has been forced that isn't in the script. (Same place
> in each case.) Note the use of the $ delimiter for the first match - it
> saves escaping all the /. That's what I meant by lock-in - it's something
> that Perl can do easily but which would cause problems in other scripting
> langauges. The split function needs the / itself so the / in the pattern is
> ecaped: \/. The . wildcard in the first match needs to be escaped to stop
> Perl thinking that the . can be replaced by any other character and still
> match.
>
> Output:
>
> perl test.pl
> old url:
> http://www.smssat.biz/scan/fi=products/sp=results_big_thumb/st=db/co=yes/sf
>=category/se=
> OtherReceivers/va=banner_image=/va=banner_text=.html?id=f8YyQGtr
> new Url: http://www.smssat.biz/
> match scan/fi=products/sp=results_big_thumb/st=db/co=yes/sf=category/se=
> OtherReceivers/va=banner_image=/va=banner_text=.html?id=f8YyQGtr
> Content 1: scan
> Content 2: fi=products
> Content 3: sp=results_big_thumb
> Content 4: st=db
> Content 5: co=yes
> Content 6: sf=category
> Content 7: se=OtherReceivers
> Content 8: va=banner_image=
> Content 9: va=banner_text=.html?id=f8YyQGtr
>
> Is that in the right direction? I'm sure you can process each content value
> as appropriate from here and create the new static URL by a similar
> process.
>
> You'd then just call the perl script as part of the copy process - in a
> pipe. The output of the live site would be piped into the input of the perl
> script which would transform it and output suitable static links to
> whatever process you want to use to write the output to files. (Perl could
> do the whole thing for you).

Sorry, it's getting late, I'll read this bit in the morning coz my head's 
b0rked!  What were we trying to do again?  Change the delimeter to something 
more sensible than "/"?

The only trouble is that I'm not quite sure how to surgically insert this into 
Interchange.  I think a more likely solution is that I get this Interchange 
store to a situation where it delivers reasonable performance (from a cache) 
and where the bulk of the pages are indexed by Google (which it will 
hopefully do now, fingers crossed) and then I'll start looking at 
alternatives.

I think I've pretty much had enough with Interchange; although I've persevered 
with trying to get to grips with it's internals over the last year or two, 
I'm growing increasingly impatient with it.  It's too big, it's too 
complicated and it's too goddam slow.  I only really chose it because I 
thought it was the only powerful, free, open source ecommerce suite available 
at the time (although I might have been wrong).  Certainly now something like 
OSCommerce, which uses PHP, Apache and MySQL rather than trying to do it all 
itself like Interchange, seems like a much better solution.  I'll have to 
have a play with it before I can decide, but at first glance it looks far 
more elegant.

Thanks very much for your help!

Cheers,

Jon
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE/BgHdeTVvFHAhe5cRAhhDAJsEvH6HiY8obg3CWmD4ZuHgdUI4bgCcDQFB
n2+us0rbkAilKtGHSmqx/6o=
=514j
-----END PGP SIGNATURE-----


--
The Mailing List for the Devon & Cornwall LUG
Mail majordomo@xxxxxxxxxxxx with "unsubscribe list" in the
message body to unsubscribe.
Re: [LUG] More scripting silliness...
- From: Neil Williams
References:
- [LUG] More scripting silliness...
  - From: Jonathan Melhuish
- Re: [LUG] More scripting silliness...
  - From: Jonathan Melhuish
- Re: [LUG] More scripting silliness...
  - From: Neil Williams
Prev by Date: Re: [LUG] Mail Servers
Next by Date: Re: [LUG] Mail Servers
Previous by thread: Re: [LUG] More scripting silliness...
Next by thread: Re: [LUG] More scripting silliness...
Index(es):
- Date
- Thread
Lynx friendly