D&C Lug - Home Page
Devon & Cornwall Linux Users' Group

[ Date Index ][ Thread Index ]
[ <= Previous by date / thread ] [ Next by date / thread => ]

Re: [LUG] More scripting silliness...



On Wednesday 02 Jul 2003 9:19 pm, Jonathan Melhuish wrote:
>
> Sorry, that was probably a bit unclear in my original email.  Although it
> is indeed a "query string", it isn't passed in the normal "?variable=value"
> way, it's passed as a supposedly 'normal-looking' URL, eg.
>
> http://www.smssat.biz/scan/fi=products/sp=results_big_thumb/st=db/co=yes/sf
>=category/se=OtherReceivers/va=banner_image=/va=banner_text=.html?id=f8YyQGt
>r
>
> I'm not sure why they decided to do it like that.  I dunno, I didn't design
> it guv ;-)  But is there anything technically wrong with an URL like that?

Technically wrong? I'd say it was stretching the rules because:
1. It uses a non-existent filesystem: It's pretending (if you read the URL 
strictly) that there are 8 sub-directories below the .biz domain whereas none 
probably exist with the names specified (with or without the = ).
2. It uses non-standard repetition: It's imitating a query string and then 
adding a real one (the xx=xx would appear to some form of variable=value 
statement) - repetition that is likely to cause many a parser to barph.
3. Required filedata is absent: There's no 'real' file anywhere for processes 
like Google to grab onto - I'd presume there's some index.php default.asp or 
similar behind it but it's not stated and therefore must be assumed, which is 
often a bad tactic.
(Assume = an ass out of u and an ass out of me).

Stretching the letter of the 'rules' but breaking the spirit? Personally, I 
wouldn't like to use an engine that relied on this type of persistence.

I'm not surprised that it doesn't parse well with processes like Google.

Incidentally, the W3C validator site can parse the URL but the engine itself 
responds with some very bad HTML - it uses a HTML4 Transitional Doctype 
(which would usually mean that someone cares about producing valid code as a 
DocType isn't any use to a browser, only a validator engine like at W3C) but 
uses tag attributes removed from HTML4 (marginheight), omits required 
attributes (img alt=""), fails to properly nest tags, omits to properly 
escape entities (& should be replace with &amp;) and puts settings in HTML 
that should be in CSS (img border=0). The validator URL is far too long to 
post here (as it includes the whole URL you quoted plus an extra query string 
for W3C settings). Incidentally, the validator turns the whole URL into the 
hexadecimal characters I mentioned last time. Here's the first bit:
http://validator.w3.org/check?uri=http%3A%2F%2Fwww.smssat.biz%2Fscan%2Ffi%3Dproducts%2Fsp

%2F  /
%3D =

Here are 5 of the 40 errors reported by the validator:
Line 18, column 19: there is no attribute "MARGINHEIGHT" (explain...). 
Line 51, column 61: there is no attribute "BORDER" (explain...). 
Line 22, column 154: required attribute "ALT" not specified (explain...). 
Line 73, column 143: cannot generate system identifier for general entity 
"mv_pc" 
 ...k/sms.ic/ord/basket.html?id=DabMJzjp&mv_pc=14" class="menubarlink">Your 
baske
Line 795, column 5: end tag for "TABLE" omitted, but its declaration does not 
permit this (explain...). 

It would take some time to bring that page to the intended HTML4 Transitional 
standard proclaimed at the top of the page returned from that URL.
   1:  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
   2: <html lang="en">
   3: <head>
   4: <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

It would take some time longer to get the engine itself to consistently 
produce valid HTML4 Transitional code - and a decent understanding of the 
engine itself too.

Perhaps this URL format is what we have to put up with if cookies get such a 
bad press. Essentially, the URL appears to be trying to track the current 
transaction(s) and results - exactly what a cookie should do. If a cookie was 
properly designed and used, the entire construct could be replaced and you'd 
have a normal directory and filename after the .biz/ which Google would be 
only too happy to parse. Other engines like this use a server-side database 
to store all this info and a normal query string with the ID= setting to 
retrieve the rest of the data from the server database. (See the DCLUG Wiki 
as an example of database driven persistence). That requires an extra step in 
installation and an extra layer to debug - not always appealing but not 
actually that hard to implement because so many components fit neatly within 
the appropriate public standards.

Is there a different engine available for the job?

-- 

Neil Williams
=============
http://www.codehelp.co.uk
http://www.dclug.org.uk

http://www.wewantbroadband.co.uk/

Attachment: pgp00010.pgp
Description: signature


Lynx friendly