SLUG Mailing List Archives
Re: [SLUG] Re: Why XML bites and why it is NOT a markup language
- To: slug@xxxxxxxxxxx
- Subject: Re: [SLUG] Re: Why XML bites and why it is NOT a markup language
- From: telford@xxxxxxxxxxxxxxxxxxxxx
- Date: Fri, 10 Jun 2005 14:36:13 +1000
- User-agent: Mutt/1.5.6i
-----BEGIN PGP SIGNED MESSAGE-----
On Fri, Jun 10, 2005 at 11:04:07AM +1000, Matthew Palmer wrote:
> The problem with XML isn't that it's a crap language, it's that people are
> very poor at following instructions. When a spec says "thou MUST do it this
> way", instead of doing it this way, people think "that's not important" and
> don't do it.
People are the constant part of the equation here. What you say is correct
and also immutable.
People might happen to mostly follow a spec if that spec is dead simple and
following it is also an easy thing to do. The XML specification is overly
complex, difficult to understand and there are lots of subtle ways to get
things wrong... end result is that it is a reasonable expectation that most
of the things you get with a "*.xml" filename will not exactly conform to
specification. If they do conform to specification then most likely at some
random future date someone will press the single tick key and what you
thought was working will fall in a heap.
> I'm not sure whether the problem is basic human nature,
No, the problem is a refusal to work within the confines of basic
> or because we've
> been conditioned by so many really bong specs to ignore anything that
> doesn't make immediate sense to us...
And that too.
> As for the comparison with HTML, web browsers have been written to accept
> random garbage and try and make something useful out of it because that's
> what the web consists of.
Correct... and that's what makes HTML successful. The whole "world wide web"
thing simply would not have happened if we started out with something as
strict and breakable as XML.
> While it would be theoretically possible to do a
> similar thing with XML, it's a lot harder because you can "guess" what to do
> with bad HTML because of the limited use-case of HTML -- describing a web
> page. For XML it's a lot harder, because you can't make any assumptions
> about what the meaning of the data is that you're parsing.
Then we need to accept that XML is not particularly useful and we need to
start looking for something better. I'd like to coin the name "RML" which
stands for "Robust Markup Language" which should have the following
* stream-oriented construction
* byte-oriented construction (no 16 bit encodings at all)
* supports arbitrary tags
* supports parametric tags
* never allow tags inside a tag definition
* NO guarantee of tags making a perfect tree (but parser can provide
information about tree or partial-tree structures if they exist)
* when tags are all next to one another, ordering is NOT important
(thus italic/bold is the same as bold/italic)
* at most one parameter per tag and not named parameters
(because named parameters bend your head and get very complex and
require special syntax and further because it is always better to
introduce a new tag than introduce a new named parameter)
* supports guarantee of resynchronisation to tag boundary after an
arbitrary seek into the file (scanning forwards or backwards) and
something that "seems to be" a tag boundary always IS a tag boundary
* case insensitive tag matching (for English at least plus any other
language that sensibly defines mixed case)
* damaged files can be recovered by an automatic process at least to
the extent that lost data is proportional to the amount of damage
* don't use closing tags at all, instead use the single parameter of
the parametric tags to "update" that type of tag. e.g.:
<bold> blah </bold> blurg
is not good because knowing the font of "blurg" requires information
going back an arbitrary number of tags earlier.
<font-weight=bold> blah <font-weight=normal> blurg
is much better because scanning backwards until you hit a <font-weight>
tag will guarantee you have a full understanding of this parameter.
In other words, you don't need to parse every document all the way
from the beginning and thus large documents become managable
* non-ascii encodings are passed cleanly up to the application
level which can apply whatever translation it feels like doing
(transation libraries might be an optional overlay after
parsing is complete)
* non-ascii encodings can never break the basic tag structure,
so the parser can detect interesting encoding anomalies but can
still continue to scan the file
That's my wishlist... probably won't get done this afternoon but at least
it is down on record so that when everyone is old and grey and some young
guy says "I've invented this new tagging system that fixes the XML
nightmare that has plagued the world for so long" I can give him a link
to the Slug archives and say "told you so".
- Tel ( http://bespoke.homelinux.net/ )
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
-----END PGP SIGNATURE-----