The World Wide Web's extraordinary reach is based in part on its open assimilation of document formats. Although Web transfer protocols and addressing can accommodate any kinds of resources, the unique application context of a truly global hypermedia system favors the adoption of certain Web-adapted formats. In this paper we consider the evolutionary record that has led to the ascent of the eXtensible Markup Language (XML).
We present a taxonomy of document species in the Web according to their syntax, style, structure, and semantics. We observe the preferential adoption of SGML, CSS, HTML, and XML, respectively, which leverage a parsimonious evolutionary strategy favoring declarative encodings over Turing-complete languages; separable styles over inline formatting; declarative markup over presentational markup; and well-defined semantics over operational behavior.
The paper concludes with an evolutionary walkthrough of citation formats. Ultimately, combined with the self-referential power of the Web to document itself, we believe XML can catalyze a critical shift of the Web from a global information space into a universal knowledge network.
Keywords: Markup Languages, Metadata Systems, Information Retrieval and Modeling, HTML, XML
The World Wide Web is defined as the "universe of network-accessible information" [Berners-Lee, 1996]. Such audacity may be a hallmark of hypermedia systems development, but the Web has delivered on this promise in spectacular fashion, for two reasons. The first reason is the openness and content-neutrality of the HyperText Transfer Protocol [Fielding et al., 1997], which can adapt to exchange any document format, and Universal Resource Locators [Berners-Lee et al., 1994], which can represent links to any document format, from within many document formats.
That hypothesis would predict the proliferation of divergent document species in the Web's docuverse -- fragmenting the market between many competing word-processing formats, spreadsheet formats, image formats, and so on.
Instead, we observe the second reason for Web's success: "natural selection" appears to have favored a few formats that have been explicitly adapted to the Web. Among the profuse variation of document syntax, style, structure, and semantics, we observe the preferential adoption of SGML, CSS, HTML, and XML. Each leverages a parsimonious evolutionary strategy favoring declarative encodings over Turing-complete languages; separable styles over inline formatting; declarative markup over presentational markup; and well-defined semantics over operational behavior.
Whereas the first explanation implies a passive Web that accommodates all document formats equally, the second argues that the medium itself favors evolution from information capture towards knowledge representation. The key is that the Web can be leveraged reflexively to capture a document's structure and semantics -- that any community can define its own ontology, or adopt, extend, and combine others. In this context, we argue that the emergence of XML-based formats does not merely represent a slew of new competitors, but an ecosystem of interdependent document species.
In spirit of Darwin's own investigation, let us "acknowledge plainly our ignorance of the cause of each particular variation" [Darwin, 1872] and study some of the alternative designs for representing a chunk of knowledge on the Web. Specifically, consider business cards -- metadata about people.
In the abstract, business cards are consistent enough: names, titles, addresses, phone and fax numbers, corporate insignia, and so on. In reality, there are innumerable variations of physical form, language, and function. This applies to personal identification on the Web as well.
A card can be represented with natural language in a text file, in any layout. Email .signatures are slightly more regularized, and some programs even attempt to automatically extract personal data. Other text formats such as vCard [Versit, 1997] are explicitly designed for that. Bitmaps or drawings can represent it visually, at the cost of machine-readability of the data. An interactive applet could even represent it as an executable animation.
Web authors have several more prosaic choices as well. HTML formatting can capture its appearance as well as the textual data. Structured HTML, such as the <ADDRESS> tag, can indicate the role of the data and embed it in other documents.
Even the HTML 4.0 tagset does not include specific enough structural markup to represent the logical components of a business card [Raggett, Le Hors, and Jacobs, 1997]. HTML's centralized evolution caters the common-denominator of document markup. XML, by contrast, is designed for decentralized development of extensible tagsets [Bray, Paoli, and Sperberg-McQueen, 1997; Bray and DeRose, 1997]. Using XML, an author can create a business card Document Type Definition to define specific tags such as <TITLE> and <EMPLOYER> [Khare and Rifkin, 1997].
There is as much variability in the electronic representation of cards as in the cards themselves. In this paper, we explore some of the aspects of variation in syntax, style, structure, and semantics and discuss which alternatives seem better-adapted to the Web. Each of these issues can be identified in our evolutionary walkthrough of citation formats. Finally, we conclude with a taxonomy of several popular document formats and some thoughts about other evolutionary strategies behind the Web's success.
Consider several alternative encodings to represent a public key/name binding: X.509 certificates in binary Abstract Syntax Notation (ASN.1) [ISO 8825:1987] format; a PGP key information block using readable text and hexadecimal digits; or even an executable program that generates the key on demand. The choice depends on several tradeoffs between the cost of reading, writing, editing, and maintaining each syntax. We characterize document formats based on their choice of binary or text encoding; declarative or Turing-complete grammar; and mission specificity or generality.
Figure 1. Examples of document syntax
according to encoding, grammar, and mission.
(MIF stands for FrameMaker Interchange Format; COFF is the Common Object File Format)
The first tradeoff is along the spectrum from binary machine language to textual natural language. Initially, machine-specific coding seems less expensive because it directly mirrors data structures in memory. However, that mapping can be too brittle to use across multiple platforms (for example, endianness) and multiple software versions. Binary coding may also take greater space and time to pack and unpack (a criticism leveled against ASN.1). Text files, especially as S-expressions or other context-free grammars, can be as efficient. In return, even partially human-readable forms are easier to edit, repair, and extend.
The second tradeoff is along the spectrum from declarative formats to Turing-complete programs. It can be easier to write a parser for a program that produces the data than to parse the data itself. It is certainly easier to send a program that calculates pi than to transmit a billion digits of it. PostScript and TeX are powerful examples of programming languages for drawing and typesetting documents. On the other hand, it is formally impossible to manipulate or convert these documents. Trying to extract the third word from a PostScript program is equivalent to the Halting Problem [Connolly, 1994]. Declarative formats, such as context-free grammars, are formally tractable, allowing reliable document interchange and maintenance.
Finally, there is a tradeoff between using mission-specific or generic formats. To the degree that information reuse is a critical concern, there is further value to leveraging a family of related grammars. SGML, the Standard Generalized Markup Language [ISO 8879:1986] for defining and using portable document formats, was designed to be formal enough to allow proofs of document validity, structured enough to handle complex documents, and extensible enough to support management of large information repositories. The power of SGML is reflected in its flexibility in managing of documents of all types, from manuals and press releases to legal contracts and project specifications; and in its reusability for batch processors to produce books, reports, and electronic editions from the same source file(s).
There is a fundamental tension between the performance, cost, and usability of machine-readable and human-readable encoding strategies. The SGML approach strikes a reasonable balance between the two. Human readability implies robustness, and machine-readability implies validity; both qualities add value to information and ease the evolution of documents over time.
As long as there have been documents, there have been authors and designers agonizing over each stroke of the pen, each piece of type, each picture placement. To the degree that we can capture the abstract style of a layout and reuse it, the more value the design and the documents themselves have. The evolutionary history of Web document formats favors externalized formatting over embedded directives precisely so information can be represented independently of style, and vice versa [Nelson, 1997].
There have been many approaches to inline formatting-oriented representations, from troff commands to Rich Text Format (RTF) directive stacks to HTML's font tags. Almost inevitably, these are complemented with resuable formatting shortcuts: macro packages, rulers, and browser appearance parameters, in these examples. Building on this experience, the Web's premier document formats, HTML and XML, allow formatting to be externalized with style sheets. Cascading Style Sheets (CSS) goes further by allowing the composition of separate styles, encouraging separation of concerns such as font and color properties, character sets, graphic flows, and layout [Lie and Bos, 1997]. Furthermore, formatting control is shared between the author and reader, who can interpose his or her own chain of style sheets.
Styles are not limited to visual treatments -- they can control rendering for displays, paper, audio, Braille, and many other media. This is especially valuable for adapting content to accommodate physical disabilities, dyslexia, and illiteracy -- as well as situational deprivations of talking on the phone or working in a noisy environment [WAI, 1997]. Audio streams can be transcribed to text over the Web for deaf users; audio browsers such as pwWebSpeak can dictate pages to users unable to read them according to Aural CSS [Paciello, 1997; Lilley and Raman, 1997].
Though Web technology supports either inline formatting or style sheets, managing large hypertext information systems almost requires externalized formatting to remain navigable and usable.
The anatomy of a newspaper article includes a headline, byline, body, and footer. Various competing document formats attempt to capture these structures in their representations. Some describe chunks of the document in presentational terms: bold, italic, indented, and so on. At the other end of the spectrum, some use declarative terms: title, address, keyboard-input, and so on. Many other formats can be found somewhere in between on this scale, such as the HTML tag <EM> compared to <I> or <ADDRESS>.
Another kind of declarative structure for SGML (and XML) applications are Document Type Definitions (DTDs) that require valid documents to include several elements in specified order ("every occurrence of <STATE> must be prefixed by the <CITY> tag and followed by <ZIP>").
Choosing along this axis entails tradeoffs between accuracy and comprehensiblity: weaker presentational semantics are better understood universally than narrowly useful declarative types. <CITE> became part of the original HTML repertoire, but <ABSTRACT> did not, so the only alternative for WWW7 Conference authors is the presentational <I> tag. Declarative markup that clearly indicates the role of various document parts finds its value in later reuse. For example, indexing engines could assign weights to terms from an abstract more significantly; or, they could automatically extract the reporters from a set of newspaper clippings using an information-capture tool such as webMethods' Web Interface Definition Language (WIDL) [Allen, 1997].
Choosing to describe document structure by its function rather than its form calls for extensible tag sets. Centralized evolution of HTML precludes adding in an exhaustive list of all possible document idioms. A new tag potentially has ambiguous grammar (is it an element or does it pair up with end-tag?), ambiguous semantics (no metadata about the ontology it is based on), and ambiguous presentation (especially without stylesheet hooks). SGML definitively addresses these issues for new DTDs, but the engineering costs are compounded because the SGML specification does not follow accepted computer-science conventions for the description of languages [Kaelbling, 1990], a lesson relearned during the protracted effort to standardize HTML 2.0 as an SGML application [Connolly, 1994].
Communities of interest need to publish their own definitions easily, a process facilitated by using XML. These new definitions can even reach past specifying roles to include interpretations and behaviors; that is, they support new semantics.
The ultimate test of a document format is how well (or poorly) it supports the uses of its contents. Documents exist as artifacts of larger processes like purchasing, reporting, or software development, and these uses bind semantic meaning to parts of a document.
Format support for semantics falls along a spectrum of disclosure, from undocumented; through operational behavior hard-coded in the processes manipulating the contents; to well-defined, openly available and documented definitions of the contents.
Consider a "to do" list in a text file; it is used as a natural language tickler by an end-user. An HTML version with deadlines for each entry might be parsed by a script that reads the contents and alerts the user. Yet a third variant, in XML, might declare a ToDo DTD that explicitly defines a <DEADLINE> container for ISO quoted dates. Only the latter format can claim to have well-defined semantics, bound to the document itself rather than through any one application. It can be embedded within other XML documents and exchanged with other communities while retaining its unambiguous definition.
XML supports casual ontologies cost-effectively through dynamically composable DTDs and validatable instances -- addressing the twin failings of SGML: static composition and DTD-dependent validation. Furthermore, XML DTDs are named by URLs, thereby decentralizing, and thus accelerating, the cycle of publishing and adopting new document formats.
Ontologies ultimately embody the survival principle that self-representative or self-describing systems reduce the cost of entry. By developing a Web address for some fragment of knowledge containing programs and/or information, and then sharing the address with someone else, authors can allow the democratic process play out: anyone in the Web community can indicate that the ideas are good by linking to the handle for future usage. They neither copy the document as is, nor copy the document and make minor changes, because to do so would be too expensive; the cheapest way to propagate ideas is by address. This feeds the cycle of natural selection in knowledge representation: usage determines community, which in turn refines the common ontology.
Suppose while reading this paper you encounter the citation:
"XML, Java, and the Future of the Web," Jon Bosak, World Wide Web Journal, 2(4):219-228, (1997).
Using your intuition, you could ascertain this reference's meaning, whereas a digital parser might be unable to do so. Worse, because of different publishing conventions, another publisher, say the Association for Computing Machinery, might format the reference differently:
J. Bosak, World Wide Web Journal, "XML, Java, and the Future of the Web," 1997, Vol. 2, No. 4, pp. 219-228.
Even minor differences in punctuation and notation can disrupt a computer trying to parse that reference; as a result, automating the conversions between different formats representing the same kind of knowledge is challenging. The use of brittle syntax, with formatting-only and operational semantics, although meaningful enough for human readability, provides little information for machine readability.
By reformatting the citation using presentational HTML markup, the citation becomes more accessible, even if the actual rules ("the second italic phrase is the publication") are invisible:
<I>XML, Java, and the Future of the Web</I>, <I>World Wide Web Journal</I>, <TT>Jon Bosak</TT>, <B>2(4):219-228</B>, <I>1997</I>.
Although HTML lets an author provide some structuring, it requires authors and readers to agree on a convention for the meanings of attributes and values and how they are marked up. This (ambiguous) presentational structuring allows authors and readers to highlight what is important, but it really just represents an evolutionary waystation towards more meaningful structural markup.
By reformatting the citation using structural HTML markup, the actual interpretation rules become somewhat more distinct ("anything in a CITE is a citation title"):
<CITE>XML, Java, and the Future of the Web</CITE> <H2>World Wide Web Journal</H2> <H3>Jon Bosak</H3> <UL> <LI> 2(4):219-228 <LI> 1997 </UL>
The use of structure in citations -- the headers and list elements as well as the more-specific use of CITE -- can enhance the formatting tags as well.
By reformatting the citation using a customized XML citation DTD, the actual interpretation rules become very precise:
<BIB> <TITLE>XML, Java, and the Future of the Web</TITLE> <JOURNAL>World Wide Web Journal</JOURNAL> <AUTHOR> <FIRSTNAME>Jon</FIRSTNAME> <LASTNAME>Bosak</LASTNAME> </AUTHOR> <VOLUME>2</VOLUME> <NUMBER>4</NUMBER> <YEAR>1997</YEAR> <PAGES>219-228</PAGES> </BIB>
While there is an XML DTD for citations behind this example, it is still well-formed even without it. Most readers have enough context to understand its semantics, while machines can still manipulate it reasonably well ("list all the AUTHORs").
This kind of semantic markup allows the information model to be more descriptive, so a machine can capture things a community takes for granted. Syntactic problems such as character encodings and punctuation are defined using structured annotations; document manipulations such as restructuring and filtering can be automated; and each component of a document can be precisely identified.
By reformatting the citation using a combination of public XML citation-related DTDs, the actual interpretation rules remain precise, as well as being accessible on demand from the Web itself:
<?namespace href="http://library.org/bibliography-info" as="BIB"?> <?namespace href="http://www.w3.org/schemas/rdf-schema" as="RDF"?> <?namespace href="http://oclc.org/DublinCore/RDFschema" as="DC"?> <RDF:serialization> <RDF:assertions href="http://assertions.org/bib-doc"> <BIB:TITLE href="http://sunsite.unc.edu/pub/sun-info/standards/xml/why/xmlapps.htm"> XML, Java, and the Future of the Web </BIB:TITLE> <BIB:JOURNAL href="http://www.w3j.com/"> World Wide Web Journal </BIB:JOURNAL> <DC:Creator>Jon Bosak</DC:Creator> <BIB:VOLUME>2</BIB:VOLUME> <BIB:NUMBER>4</BIB:NUMBER> <BIB:YEAR>1997</BIB:YEAR> <BIB:PAGES>219-228</BIB:PAGES> </RDF:assertions> </RDF:serialization>
These tags are themselves defined elsewhere on the Web. This reflexivity enables knowledge representation for:
Efforts such as the Platform for Internet Content Selection (PICS) [Resnick and Miller, 1996] and the Resource Description Format (RDF) [Lassila and Swick, 1997] provide mechanisms for transferring machine-readable metadata describing resources among communities. PICS attaches labels to Web resources, using a URL to identify the rating service and rating scheme.
RDF combines the PICS extensions with the metadata model in Netscape's Meta Content Framework (MCF), yielding both a metadata representation model and an XML-based syntax for metadata capture and transfer. An RDF schema, named using a URL, gives a human- and machine-readable set of assertions of attribute-value pairs. The application of technologies such as PICS and RDF in community ontologies help determine commonly-understood meanings for those tags within any given community that will be using them.
The World Wide Web Consortium, the driving force behind XML, sees its mission as leading the evolution of the Web. In the competitive market of Internet technologies, it is instructive to consider how the Web trounced competing species of protocols. Though it shared several adaptations common to Internet protocols, such as "free software spreads faster," "ASCII systems spread faster than binary ones," and "bad protocols imitate; great protocols steal," it leveraged one unique strategy: "self-description." The Web can be built upon itself. Universal Resource Identifiers, machine-readable data formats, and machine-readable specifications can be knit together into an extensible system that assimilates any competitors. In essence, the emergence of XML on the spectrum of Web data formats caps the struggle toward realizing the original vision of the Web by its creators.
In fact, the Web appropriated the philosophy of content-neutrality
from MIME types: it learned how to adapt to any document type, new or
established, equally well. On the other hand, some types were more equal
than others: the Web prefers HTML over PostScript, Microsoft Word, and
many others. This preference indicates a general trend over the seven
years of Web history from stylistic formatting to structural markup to
semantic markup. Each step in the ascent of XML adds momentum to Web
|Text||ASCII, Unicode...||Lines||Natural Language|
|troff||Readable Text||Inline Directives||Sections, Pages||Typesetting|
|TeX||Readable Program||LaTeX||Sections, Pages||Typesetting|
|Rich Text Format||Opaque Text||Extensible Directives||Characters, Paragraphs|
|HTML formatting||Readable Text||Nested Directives||Presentational|
|HTML structure||Readable Text||CSS||Declarative||Fixed (e.g., <ADDRESS>)|
|XML||Readable Text||CSS, XSL||Declarative||Extensible|
|RDF||XML Text||Declarative||Metadata Schema|
This evolution toward declarative formats exists not only in the realm of documents, but in the programming community as well. The same forces appear to at work in the successive development of more declarative, less operational programming languages: for example, from machine code to assembly to BCPL to C to C++ to Java [Lea, 1998].
The Web itself is becoming a kind of cyborg intelligence: human and machine, harnessed together to generate and manipulate information. If automatability is to be a human right, then machine assistance must eliminate the drudge work involved in exchanging and manipulating knowledge, as indicated by MIT Laboratory for Computer Science Director Michael Dertouzous [Dertouzous, 1997]. The shift from structural HTML markup to semantic XML markup is a critical phase in the struggle to transform the Web from a global information space into a universal knowledge network.
We thank the reviewers, Dan Connolly, Tim Berners-Lee, and Doug Lea for their comments and recommendations.
Mr. Khare's work was sponsored by the Defense Advanced Research Projects Agency and Air Force Research Laboratory, Air Force Materiel Command, USAF, under agreement number F30602-97-2-0021. He would also like to thank MCI Internet Architecture for its support in this research.
Mr. Rifkin's work was supported under the Caltech Infospheres Project, sponsored by the CISE directorate of the National Science Foundation under Problem Solving Environments grant CCR-9527130 and by the NSF Center for Research on Parallel Computation under Cooperative Agreement Number CCR-9120008.
Rohit Khare, firstname.lastname@example.org
Rohit Khare joined the
Ph.D. program in computer science at the University of California,
Irvine in Fall 1997, after serving as a member of the MCI Internet
Architecture staff. He was previously on the technical staff of the
World Wide Web Consortium at MIT, where he focused on security and
electronic commerce issues. He has been involved in the development of
cryptographic software tools and Web-related standards development.
Rohit received a B.S. in Engineering and Applied Science and in
Economics from California Institute of Technology in 1995.
Adam Rifkin, adam at xent dot com
Adam Rifkin received his B.S. and M.S. in Computer Science from the College of William and Mary. He is presently pursuing a Ph.D. in computer science at the California Institute of Technology, where he works with the Caltech Infospheres Project on the composition of active distributed objects. He has done Internet consulting and performed research with several organizations, including Canon, Hewlett-Packard, Griffiss Air Force Base, and the NASA-Langley Research Center.