- To: slug@xxxxxxxxxxx
- Subject: [SLUG] Why XML bites and why it is NOT a markup language
- From: telford@xxxxxxxxxxxxxxxxxxxxx
- Date: Thu, 9 Jun 2005 20:10:02 +1000
- User-agent: Mutt/1.5.6i
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
So JJJ is offering an RSS newsfeed for the "Hack" program and
the URL is here:
http://www.abc.net.au/triplej/hack/podcast/podcast.xml
So here we have data made available to the whole world, put the
data into XML format, perfect for compatibility, interoperability
and etc. etc. So let's look at the practical example of this
stuff really working shall we? If you look at the header of the
file you can see:
<?xml version="1.0"?><rss version="2.0">
So no problem, I should be able to use all the existing technology
to work with this stuff and gain the benefit of code reuse and
all that good stuff. So I use perl and libxml -- proven technology.
I can fetch the file easy enough using wget, no problem so far.
OK, here's my microscopic perl program that reads the file, I'm
including the entire listing just to prove I'm not doing anything
other than the obvious:
- ---------------------------------8<-------------------------------------
#!/usr/bin/perl -w
use XML::LibXML;
my $parser = XML::LibXML->new();
$parser->recover(1);
$parser->pedantic_parser(0);
$parser->validation(0);
my $doc = $parser->parse_file( "podcast.xml" );
my $root = $doc->documentElement();
- --------------------------------------->8-------------------------------
You would think it would work real smooth but what result do you get?
podcast.xml:39: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0x92 0x72 0x65 0x20
<description>All this week we re looking at the unofficial mental health system;
^
Sure enough there is a high-ascii item in there and some Mac user has
no doubt used a proprietary bingle-bongle encoding for a single quote
even though there is a perfectly good ASCII encoding for the same.
Never mind blaming the Mac user... they do that sort of thing, it's a
fact of the universe, nothing will ever change a Mac user. However,
what really shits me is that the XML parser dies totally and completely
when it hits a single high-ascii character. This is with the "recover"
flag set, and both "pedantic" and "validation" switched off. Basically
it is running in the most lenient possible mode that it can possibly
operate in and a single bad character still nails it.
Yes I realise that in an ideal world the <?xml?> tag would contain
encoding information and yes I realise that in order to be correct UTF-8
it must encode characters above 127 in a special way and this encoding
does not conform. OK, we don't live in an ideal world, the document
has a problem. Having established that, how am I supposed to read it?
man XML::LibXML::Parser
search for "encod" -- nothing.
Looks like the error message wants you to "indicate encoding" but the
man page does not tell you how to achieve that.
It explains that you can catch the errors using an eval block, but having
caught the error there is no way to follow through and finish parsing
the file. So one single byte has rendered an entire file unreadable...
what a fantastic protocol, so good for inter-operability, so widely
compatible.
And here comes the gist of this rant...
A markup language MUST be robust. Anything that claims to be portable
and all-purpose and the document processing format of the future simply
cannot be destroyed by a single bit-flip on a single character.
The fundamental difference between a programming language and a markup
language is that a programming language can have parser errors and
syntax errors whereas a markup language cannot (by definition) have
any errors at all under any circumstances. The parser for a markup
language must be fully robust to all possible inputs and although it
certainly can result in various severity of WARNINGs but nothing must
stop the parser.
Fundamentally, XML is crap as a markup language because it simply
isn't possible to build a fully robust parser. Worse than that, you
can't recover state (even approximately) in the presence of a damaged
document, XML is brittle, as brittle as any programming language.
Let's make a simple comparison... suppose I do all my data transfer
by simple tab delimited ASCII files with one record per line.
If a line gets damaged, I might lose that line, I might even lose
the line after the damaged line but at least I have the rest of the
document. If I jump into a plain ASCII file at a random location then
I can scan around the local area until I find the end of a line and
I can resynchronise to the local records. This technique can be used
to perform a fast binary search directly into an ASCII file that is
sorted by line -- can you do this with XML? Of course not... your
basic parse-state is broken the moment you seek to anywhere at all,
and that state is perpetually unrecoverable because something that
looks like a tag can exist within a string or you can have a CDATA
or some other stupid thing.
There's a philosophical argument about whether brittle languages
or robust languages are more useful for data transfer. Some would
argue that if you are sending commands to a mars rover then all
your command transmissions should be brittle because if the
communication stream gets damaged then better for the rover to
ignore the command rather than try and have a go and possibly
screw up something important.
I've got three answers to the above. Most importantly, you don't
use markup language for talking to a mars rover... you use a
programming language and we all agree that programming languages
are brittle and always will be. Another (still significant)
point is that you can always take a robust language (e.g. simple
TAB delimited text file) and make it robust by adding a CRC or
some sort of signature system... you cannot take a brittle
language and make it robust. Finally, your brittle language
still isn't full protection against a comms failure because
sometimes a single bit flip (like turning "1" into "3") will
have disastrous consequences to the command but will look fine to
the parser. So the parser isn't real safety, it is at best a
false sense of safety. You still need CRCs and the like.
Thus if anyone is going to design a communications language it
should be a robust and that means it can recover from problems
and can guarantee resynchronisation from an arbitrary seek.
XML doesn't live up to the promise of being a universal markup
language because it is too annoying an too brittle.
By the way, how DO I get perl to read such a file?
Do I have to write my own parser?
- Tel ( http://bespoke.homelinux.net/ )
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
iQIVAwUBQqgVecfOVl0KFTApAQKqOw/7BDUW3e8dATSN89m1lRTxuoAcbjrVb+qj
V56atuz7J1hhUIK2MHTACnpq0GzW4Yf/2yg30G7bXNlThGLBgZFX+eLtwkiIfoRY
NVE14JQuNeWyfHfQpuIR/2PwbbsuqJ0wPhV1BqEZE8MCx8+kVY/vI02sxgb5XgfW
h/WVuXMjKGU/ZBsKG1xLL54NXCTCyp3opPYhTX4GAD0vXAE27CAi2Z+ihzMBvXPg
sh7tPXrzSqj+sCz9GGcW/aUif/r0hdnTVykhEkB64CUsRH1moLFeG027GuPDKe5f
swoAv88ad2noL58RMqOy+43zw08TL85kZ+nGtd+Nywwtx8yF0gwyFqBNIzt3puWE
I2DOfWN/+zbwSITEQWfGq0kRekMY2zimruNcL0sirZSoK2IJby0GfBwhvuN5twQc
dN5szgooIGUtLLn9R9ehh5dGZfCKoknSFNkQIBnlVWwRwC4P6jEs5Fjgs/hSaeZf
UnLLTVPjZtwCWV4Pu/YGg3yEvBCDNytjbUTclgzOrPQwfUDiaWGpoMeWsSWzKgOu
6aYzHPXc9RFoYNw7KGtzTzBKMOxC00KqafkbDb0lYMpeK0RbRUq6Ta1Tojuk0KaD
kYqYgrJptytWUioZxO8RWY3SgMNSoHqo98/6nA0LjyfU+ayKpwLMbJF2yTmvVzMe
wYa/C6G9ywU=
=sNKY
-----END PGP SIGNATURE-----