04 January 2007

XML is not for documents

i've been reading
goldfarb's xml handbook
(4th edition, 2002)

and i'm appalled to see
that even goldfarb
makes device-independent display
a cornerstone of his rationalisations

which i still find
disingenuously inside-out
so i'll retell the xml-story here
outside out, imho:

every computer user
maintains dozen of databases
(buddy lists, email archives, etc)
and sometimes chooses to share
some of that data
with others

or may want to merge
others' data
into their own

databases traditionally demand
rectangular arrays of labeled rows and columns

and the traditional way to share data
was to declare in advance
(as metadata)
how many rows and how many columns
the message contains
and what each row and column means

and then send all the cells in sequence
separated by tabs or commas




xml instead
uses labeled tags as separators
eliminating the requirement for rectangular arrays
(a 'row' can contain
any number of cells
of any type)

and permitting some of the required metadata
to be included in the message itself

thus inching towards
full automation
of data sharing




since human-to-human messages
are conventionally
prose documents (not database dumps)
yet often contain
the identical text-strings
we'd want in the databases
(calendar events, book citations)

the possibility was mooted
of embedding tags
within ordinary prose formatted as xml
so that comparably automatic merging
of these prose substrings
into the relevant databases
might be possible




in effect
document authors would now be addressing
not just humans
but also their machines
and tags would be added
(and phrasings tweaked)
so the machines could understand too

(if AI were far enough advanced
this would be unnecessary
because the machines could equally understand
the raw, untagged prose)

no some prose substrings
like titles and subheadings
will have characteristic formatting
(larger/smaller, bold, italic)
chosen by the author

and the designers of xml
(and before xml, goldfarb's 1969-1986 sgml)
suggested that every such formatting change
should be signalled with an xml tag
even if their likelihood
of ever being useful in data exchange
was doubtful
(simple italics, paragraph breaks)




somehow, horribly
this
gamble
was elevated to the status of revelation
(call it goldfarb's conjecture)

and the insupportable additional claim was made
that 'cleansing' documents
of all formatting markup
and replacing it with 'structural' tags
would be a significant advance
towards an entirely different goal:
creating a single master file
that can be efficiently viewed
on any random device
(big monitor, small monitor,
printer, speech-synthesizer)

with the reductio-ad-absurdam
that 'EM' and 'STRONG'
were better than 'I' or 'B'
(for italic and bold)
because speech synthesizers would find
'I' more baffling than 'EM'

and taken now to the absurder extreme
that before
any
device
can render an xml document
it has to refer to a
custom-built stylesheet
that translates the xml tags
back into comprehensible styling instructions

thus making more work
both for the machine
and for the author

if the goal is device-independence
what's needed instead
is a vast shared 'namespace'
of document structures
each with a pre-agreed default rendering
on every class of device

that authors can override
(if the care to be bothered)
but that default to
best-possible-under-the-circs
styling

these structures to include
plain old styles like italic
for unregenerate reactionaries (like me)

and the fascinating web-page structures
xmlers have never gotten around to




but the ideal of mixing these
'style structures'
with semantic tags
seems unpromising to me

(if you mention the same person
ten times in an article
do you embed that person's metadata
ten separate times?)

i'd rather see
machine-readable footnotes
than monstrous human-machine hybrids

xml was never a good fit for documents





.