An Initial Investigation of XML, by Adam Rifkin

In which the author attempts to explain how XML mechanisms will pave the path for revolutionary new applications in the World Wide Web in the areas of markup, transfer, archiving and persistence, automation, distributed transactions, metadata, data exchange, database interchange, Web-based sharing, distributed objects, and distributed computations.

Disclaimer: This page is for my own personal reference only; it is not intended to be an authoritative word on ANYTHING. Email me with suggestions on how to make this writeup better.

Table of Contents

  1. What is XML?
  2. How Can XML Be Used?
  3. Application: XML and Archiving
  4. What is a Document?
  5. XML and Metadata
  6. Web Links to XML-Related Information
    1. XML for Beginners
    2. Advanced XML
    3. XML-Related Issues and Papers

What is XML?

HTML is the HyperText Markup Language standardized by W3C for storing and exchanging documents on the World Wide Web. HTML was designed to be simple enough to support ease of authoring Web pages, rich enough to support multimedia embedding in documents, and flexible enough to support hypertext linking.

HTML is based on SGML, the Standard Generalized Markup Language standardized by ISO for defining and using portable document formats. SGML was designed to be formal enough to allow proofs of document validity, structured enough to handle complex documents, and extensible enough to support management of large information repositories.

W3C's SGML working group, with present efforts given in their activity page, is attempting to standardize the delivery (in Web documents) of self-describing data structures with arbitrary depth and complexity. To that end, they are simplifying SGML for use with the Web (and Web technologies such as Java).

XML, Extensible Markup Language, is a simplified (but strict) subset of SGML that maintains the SGML features of validation, structure, and extensibility. XML is a standardized text format designed specifically for transmitting structured data to Web applications. In addition, XML's goals of being easier to learn, use, and implement than full SGML will have clear benefits for World Wide Web users, making it easier to define and validate document types, to author and manage SGML-defined documents, and to transmit and share them across the Web.

The Extensible Markup Language specification describes XML documents, a class of data objects stored on computers, and partially describes the behavior of XML processor programs used to read XML documents and provide access to their content and structure. XML allows generic SGML to be served, received, and processed on the Web in a manner similar to what is done with HTML today. XML has been designed for ease of implementation and for interoperability with both SGML and HTML.

XML documents are composed of entities, which are storage units containing text and/or binary data. Text is composed of character streams that form both the document character data and the document markup. Markup describes the document's storage layout and logical structure. XML also provides a markup mechanism to impose constraints on the storage layout and logical structure of documents.

XML and SGML

XML, like SGML, is a meta-language for describing the markup of different types of documents. However, its specification is 26 pages (versus 500 for SGML!). The W3C hopes that offering a simplified version of SGML will make implementing SGML much more palatable to vendors of Web authoring and browsing tools.

XML is not a replacement for SGML. Many features of SGML were left out to keep XML simple. Current SGML users may choose XML for network delivery, and since XML is a valid subset of SGML, the translation from SGML to XML is straightforward. XML was developed as an easy on-ramp to SGML for people who are not yet using it.

To simplify SGML, the W3C working group dropped support for certain features that put a heavy processing burden on SGML client software. For example, a well-formed XML document is unambiguous, so a browser or editor can read the tags and create a tree of the hierarchical structure without having to read its document type definition. XML also does not allow markup minimization, requires that empty elements be self-identifying, and does not support several other complex SGML standard features.

XML and HTML

XML is not a replacement for HTML, either: HTML is a useful tool for storing and exchanging small hypermedia documents across the Internet. Furthermore, it is easy to generate HTML documents on the fly from XML (or SGML) documents. XML is designed to complement HTML by enabling different kinds of data to be exchanged over the Web.

For example, current limitations in World Wide Web technologies do not allow the extensibility, structure, and data checking necessary for large-scale commercial Web publishing. Jon Bosak's excellent paper "XML, Java and the Future of the Web" explains how XML can enable advanced Web applications, allowing Java applets to embed powerful, automatable data manipulation facilities directly into Web clients.

Unlike HTML, which has a fixed (though ever-changing) set of tags, XML lets you define your own tags and attributes. Support for XML by the Internet community would open up vast new possibilities for Internet publishing: instead of shoehorning all documents into HTML, or having to invent a browser to handle non-HTML documents, XML would enable a wide array of documents with user-defined tagsets to be handled by generic Web application software. As Tim Bray pointed out, "[XML allows us to] finally get off the HTML treadmill."

XML and Java

Presently, an author can create rich documents with an application, and then use a Java applet viewer to attach those documents to Web pages. As long as the browsers continue to provide only crude formatting, such measures are unfortunate but inevitable, much in the same way people use desktop publishing applications to get better typography than can be done with off-the-shelf word processors.

But there is no reason why the concept of a "basic Web page" needs to be limited to a single tag set! The appeal of the Web is its simple hypertext scheme, which provides a simple, unambiguous method of pointing to files with unique names. Although it is handy that HTML is also simple, the success of word processors has demonstrated that consumers can cope with multiple document types.

When XML becomes more widespread, Web authoring tools will become much more flexible in handling basic document constructs. WordPerfect and Word will export directly into XML, using the style names as tags instead of filtering everything into 90 (or however many currently exist) predefined tags.

In such a brave new World Wide Web, Java's role will be to do interesting things with the content, such as mediation between formats, computation and event handling, automation of tasks and dynamic content, presentation of different views to different viewers, and even intelligent filtering of content. XML specification co-editor Tim Bray succinctly put it, "XML gives Java something to chew on."

Learning More about XML

If you are a complete novice, to start learning about what XML provides (that HTML does not), read Dave Siegel's article, "The Web is Ruined and I Ruined It". I have also collected several other links for XML beginners.

Bert Bos' simple XML is a good place to find a few examples of XML files. He also describes the XML data model, which represents information content of an XML document as the linearization of a tree structure with several character strings at each node of the tree.

For answers to frequently asked questions about XML, see the overviews at Textuality and University College Cork. I have also collected several advanced XML links and XML-related issues and papers.


How Can XML Be Used?

XML provides a standard way for information providers to add custom markup to information-rich documents, so that complex documents can be rendered (and published) in a dynamic way. Compare this with HTML: although HTML is easy to use, its simplicity significantly constrains how publishers and users can represent and use documents and databases. XML provides the means to publish and receive any information, regardless of format or origin, in any way desired.

With XML, users will be able to manage documents dynamically in a Web browser, allowing the presentation of personalized Web-based information. Content providers will be able to distribute structured databases that users can manipulate at will, due to several features XML (like SGML) provides that are not available in HTML:

  1. Extensibility: Authors can define new tag names and attribute names for documents by specifying their syntax and semantics.
  2. Structure: Documents can be containers for other documents, with arbitrary nesting. This allows complex documents to be constructed from simpler documents. Please see the section What is a Document? to understand how rich this feature is.
  3. Validation: If desired, any XML document can contain a description of its grammar so applications can validate that the document conforms to its specified structure. Furthermore, this process of validation can be automated.

The working draft for XML 1.0 provides a complete specification in two parts: the extensible markup language itself, and methods for associating hypertext linking and stylesheet mechanisms with XML. From this specification, we observe that expressive power, teachability, and ease of implementation, were all major design considerations. And although XML is not backward-compatible with existing HTML documents, we note that documents that are HTML 3.2-compliant can easily be converted to XML, as can documents conforming to ISO 8879 (SGML) and documents generated from databases.

As far as XML acceptance is concerned, Microsoft has embraced XML for future releases of its Internet Explorer, and Netscape is considering doing the same for its browser. XML will most likely gain widespread acceptance once the power of XML-enabled applications is realized. Please see Jon Bosak's paper for detailed descriptions of some XML-enabling applications; according to Bosak, these driving applications will include:

  1. Applications that require a Web client to mediate between two or more heterogeneous databases.
  2. Applications that attempt to distribute a significant portion of the processing load from a Web server to a Web client.
  3. Applications that require a Web client to present different views of the same data to different users.
  4. Applications in which intelligent Web agents attempt to tailor information discovery to the needs of individual users.

Although scripts embedded in HTML documents may provide some solutions to these problems, XML does not commit authors to individual script languages, authoring tools, and delivery engines. Instead, XML offers an "open standards" solution that is vendor-independent, giving information providers greater control over their content and context.

We note that the current XML draft only addresses syntax, and consequently XML alone can at present only be used for interprocess communication and for the delivery of documents to specialized applications (or plug-ins) that have been configured to interpret a particular XML grammar. However, much more work on XML is on the way.


Application: XML and Archiving

After writing "Webs of Archived Distributed Computations for Asynchronous Collaboration" for the Journal of Supercomputing with my coauthors in the Caltech Infospheres Project, my discussions with Rohit Khare led to my understanding that XML can provide useful mechanisms for archiving information, denoting machine-readable and person-readable description of document contents, automating services, and mechanizing persistent transactions.

After perusing Rubinsky and Maloney's SGML and the Web, my hunch is even stronger that XML is the ultimate persistence format, simply by tagging everything properly so it fits into the larger schema of the Web. I may decide to use this idea in conjunction with infospheres sessions or elsewhere in the infospheres framework for my PhD work here at Caltech.

As discussed with Mark Baker in an epiphany or maybe two on the dist-obj list, XML can also be used for automation and self-description. Instead of just having Java view objects, business components can negotiate with an XML container for presentation and data exchange. Since any XML document can contain a description of its grammar, applications can use this description to perform structural validation. Plus, XML includes the ability to define new tags so that XML documents can become self-describing containers (so I can tag the data "Adam" as the value of an attribute FIRSTNAME for the tag <PERSONNAME>). Can you see how this is quite similar to WIDL both in design philosophy and actual implementation?

In short, XML provides a mechanism for streaming self-defining data containers into a browser, from an object... all using a simple document model! Furthermore, this simple document model is useful both to users of the Web, and to users of CORBA-compliant objects... which begs the question, what IS the difference between an "object" and a "document"? Turns out, not a whole lot...


What is a Document?

Although most people still do not realize it, both the "Objects" community and the "Web" community share the same vision, including a somewhat similar architecture: Web servers are document factories, and likewise, object-request brokers are object factories. This was the very premise of the joint W3C/OMG workshop on distributed objects and mobile code in 1996, for which we provided a position paper on infospheres. Our paper, like many other papers for that workshop, held the firm belief that objects and Web documents are two sides of the same coin.

Combine this observation with the fact that A DOCUMENT CAN BE ANYTHING -- any bag of bits on the client-side and/or server-side -- from

  1. a static bunch of text (without or with semantic markup given by XML), to
  2. image maps, to
  3. embeddable multimedia types, to
  4. a semantic and/or syntactic description of interfaces (see WIDL), to
  5. an object (in the Object community's sense -- see the OBJECT tag), to
  6. a script (in the scripting community's sense -- see the SCRIPT tag), to
  7. actual embeddable running executable code (from Java applets to Infospheres, and everything in between) running on the client and/or server side (actually, it's more than just code: it is code, state, and IDENTITY together in a nice package.. identity is probably the most important item that can be shipped...), to
  8. various container mechanisms for combining/composing other documents within a document (from Java Beans to Active X containers to Infosphere Djinns, and everything in between), to
  9. linking mechanisms tying a document to any other document,
and you really begin to see the possibilities. Add in a Protocol Extension Protocol for use with the next generation of the hypertext transport protocol HTTP, and Web peers can do content negotiation automatically, on the fly. To that, add an economic model such as kudos, and suddenly technologies like munchkins are feasible!

Let's go a little further in exploring documents and objects. A document is a bag of bits and an explanation of how those bits work. So is an object. A document can have a specified interface. So can an object. A document can have specified behavior for itself and/or its components. So can an object. A document can provide semantic clues to what it does. So can an object. A document can be distributed. So can an object. A document can encapsulate other documents. An object can encapsulate other objects. A document can inherit its type from another document, as can an object. Document types can be polymorhpic, as can object types.

A document can send a message to another document. A document can have a notion of an "active thread of control." A document can invoke a method on another document.


XML and Metadata

Sometimes browsers and other users of documents need to know about those documents to check the meaning of things, parse the contents, and use the documents effectively. Metadata is the solution to this problem, and I believe XML is a fine solution for metadata.

Jim Whitehead's paper "A Proposal for Web Metadata Operations" defines metadata to be "information about information... Information on the Web, known as Web resources, have many pieces of associated descriptive information which is often not explicitly represented in the resource itself. Examples of metadata include the creator of a resource, its subject, length, publisher, creation date, etc. Such descriptive metadata can be used to make information easier to locate by improving Web searches, rate information to protect children from indecent content (e.g. the Platform for Internet Content Selection (PICS)), capture copyright information, contain a digital signature, or store cataloging data. Many other uses are also possible."

Jim Whitehead goes on to declare that "Another type of metadata is the relationship. A relationship captures an association between two or more resources, and can be one to one, one to many, or many to many. Relationships can be used to capture navigational relationships, such as "go to this resource next," or a table of content, and can also express hierarchies (parent/child, successor/predecessor) Relationships have many domain-specific uses, such as a piece of software which has many "implements" relationships with a requirements document. Annotations are another use of relationships in which the relationship points to commentary material on the resource. The use of relationships to capture associations between data items is an old idea, stemming from semantic data modeling, and early hypertext work on the NLS and Xanadu systems."

Metadata on the Web

Now, let's take this a level further. Metadata on the Web should be applicable to resources of any media type. Note that on today's Web, many resources are not HTML (to name a few examples, Java applets, Adobe Portable Document Format, and Adobe Postscript). Metadata on the Web should provide descriptive information about Web resources of any media type, including those that have no built-in provision for storing general purpose metadata (and never will).

Rohit Khare identified some overlap between some current metadata proposals:

  1. PICS. They realize that URLs are not enough for secure pointers, so they had to figure out how to differentiate different version, media-types, languages, etc that could be behind one location. Led to...
  2. PICS-NG. collided with a separate intra-PICS movement to have more structured rating values (strings, structs, pathnames, set inclusion/exclusion, etc). Ora Lassila from Nokia is working on that draft at W3C.
  3. WebDAV. Jim Whitehead's team effort with respect to metadata is described in his "Proposal for Web Metadata Operations".
  4. Dublin Core, et al. Actual, concrete metadata schemas were banging on our door for acceptance too. Who defines "author", "publisher", etc? The usual digital library community suspects.
  5. SiteMap. Microsoft originally proposed a stylized use of HTML to outline a site, for use in collapsible 'remote controls' and printing. Used nested ULs to indicate hierarchy, etc -- too much tacit knowledge.
  6. Digital signature manifests. It becomes evident almost instantly that one needs to sign packages, not atomic blobs, so we needed a DSIG Common Manifest Format for enumerating bills of materials.
  7. Email to HTML. Qualcomm wanted to use HTML as the native UI format for mail, but need a way to structurally markup quoted regions, etc. Drove an ABOUT tag proposal Dave Raggett made earlier, in order to associate metadata about one quotee in several quotes.
  8. XML. Of course, at the same time as wars are being fought between ()s (PICS) and {}s (PEP), <> has emerged as the industry standard for "open dust" (e.g. Open Financial Exchange, most amusingly, HDML, most corrosively). So SiteMaps morphed into XML-syntax-based proposals. Hence the CDF submission, metadata about push channels rendered in XML.

While these metadata proposals are in some sense roughly related (they all use the word metadata), most of them are complementary technologies. For example, Dublin Core, MARC, and RFC1806 (the Dienst bibliography format) are all bibliographic record formats, created (as you would hope) by researchers from the digital libraries community. These formats were NOT intended to solve the general-purpose Web metadata problem -- for example, none of these bibliographic record formats can effectively convey PICS-like rating information. On the other hand, PICS is not a good bibliographic record format. Thus, PICS and Dublin Core/MARC/RFC1806 can peacefully coexist.

The WebDAV Proposal

Jim Whitehead's "Proposal for Web Metadata Operations" gives a framework that explains the relationship between "large chunk" metadata proposals like PICS, PICS-NG, Dublin Core, MARC, Web Data, etc. and the "small chunk" HTTP extensions proposal. His proposal contains an extensive hyperlinked reference section, which makes it easy to track down the source material being described. The remainder of this section is quoted from Jim Whitehead himself off the FoRK mailing list.

Basically, the proposal extends the HTTP object model to create a new area for state storage within a resource, to be used for the storage of name/value metadata pairs. While there is no effective upper bound on the length of a metadata item (and hence you could make a name/value pair like "PICS-label", "{an instance of a PICS label}", typically you'd want to create a link on the resource which points to the PICS label which is itself stored as a separate resource. Methods are introduced to create name/value pairs (ADDMETA), delete name/value pairs (DELMETA), and to access name/value pairs (GETMETA). The GETMETA method is bundled with a simple s-expression like search syntax, so if you want to get a listing of all the attributes on a resource you'd pass a search specification of (OR (AND (name "*")(value "*"))). Hypertext links are defined as a special type of metadata with some constraints on the format and semantics of the value of the link name/value pair (e.g., name="DAV:/Link", value="Type = {token} Source = {URI} Dest = {URI}").

The WebDAV proposal supports small chunk metadata and large chunk metadata. It doesn't address packaging issues, because there are already many proposals for how to package metadata. Far fewer proposals actually address how this metadata is stored and associated with the resources they describe. Because the proposal is implemented via HTTP, it also provides the ability to store metadata on resources of *any* content type, not just HTML. The WebDAV proposal describes "how" metadata is stored and associated, while efforts like Dublin Core, PICS-NG, Web Data, etc., describe "what" metadata is stored, and its packaging.

Thus the WebDAV proposal is complementary to packaging efforts such as Dublin Core, PICS-NG, Web Data, XML, Digital Signature manifests, and so on.

Regarding the Protocol Extension Protocol, PEP only describes extensions to HTTP that involve adding new headers to modify the semantics of existing methods. WebDAV is proposing to add several new methods to HTTP, and hence is outside the scope of PEP. This applies equally to methods like COPY and MOVE as well as to methods like GETMETA. As for HTTP purity, Roy Fielding was present at the meeting where they crafted the GETMETA method (he helped write the BNF for the search syntax), and there are few others who can claim the mantle of "HTTP purist" more effectively than he.

And Yet, I Still Have XML on my Mind

That aside, I still think XML has an excellent role to play in the evolution of metadata for the Web. Then again, I think trust on the Web is really important, too. It's definitely something to think about. Meanwhile, enjoy some of the links I've collected pointing to some of the best XML information I could find on the Web.


Links: XML for Beginners

  1. The Web is Ruined and I Ruined It by Dave Siegel ( mirror on FoRK / mirror at SIL)
  2. What is XML?
  3. Building Blocks - the relationship between WIDL and XML
  4. Thoughts and Software on XML by Bert Bos
  5. XML at Textuality
  6. XML at W3C article
  7. Design Principles for XML
  8. webMethod's technology
  9. SGML: A Gentle Introduction from Text Encoding Initiative

Links: Advanced XML

  1. Archives of the W3C-SGML-WG Mailing List
  2. Archives of the XML Developers Mailing List
  3. Commonly Asked Questions about XML
  4. DSSSL Online Application Profile / JADE / SP (a C++ parser for SGML and XML, by James Clark)
  5. Extended Pointers in TEI
  6. FAQ about XML
  7. Inserting Objects into HTML with the OBJECT tag (client side includes)
  8. Inserting Scripts into HTML with the SCRIPT tag (client side scripting)
  9. JML, a Java Extension to HTML
  10. Lark, the full XML parser in Java, by Tim Bray
  11. Linguistics: Representation and Exchange of Knowledge
  12. news:comp.text.sgml
  13. NXP, a validating XML parser in Java, by Norbert Mikula
  14. PEP, the Protocol Extension Protocol
  15. QWeb, Web Browser that displays SGML documents (including HTML) according to style sheets. Here's a list of features: Progressive rendering of _everything_, Parses SGML documents, and displays according to a style sheet, Dynamically builds SGML document parser from a DTD, Normal and preformatted text, as well as tables and images, Entire document appearance is customizeable through the style sheet, Uses non-blocking I/O for throughput and responsiveness, Server-side image maps, Multi-level bookmarks, Caches web pages in memory, "file" and "http" URL methods, Multiple open browser windows, History of visited URLs, Follows HTTP redirects automatically, SGML Catalog editor, Style Sheet editor, Supports HTML forms.
  16. SGML Web page by Robin Cover, THE page for SGML
  17. SGML: Academic Projects by Robin Cover
  18. SGML, DSSSL, and HyTime
  19. Simple XML, a simplified XML with Java software, by Bert Bos
  20. Specialists in Generalized Markup Language
  21. TEI Extended Pointers, a brief tutorial
  22. URNs // URNs
  23. VTML - Versioned Text Markup Language
  24. Warwick Framework // Active Annotations by Ron Daniel
  25. WIDL, the Web Interface Definition Language from webMethods to automate the Web
  26. W3C Activity in XML
  27. XML, Java, and the future of the Web by Jon Bosak
  28. XML 1.0 Specification working draft (mirror)
  29. XML: Summary Descriptions and News Articles by Robin Cover

Links: XML-Related Issues and Papers

  1. Architecture of the World Wide Web
    1. An Evaluation of the World Wide Web with respect to Engelbart's Requirements, by Dan Connolly.
    2. Addressing and Naming (URLs, URIs, ...)
    3. Distributed Objects (and mobile code)
    4. Exchange (HTTP)
    5. Interfaces in Web Objects
    6. MarkUp ( SGML/XML, Style Sheets, I18N, Math)
    7. Propagation, Replication and Caching
    8. Real Time Multimedia
    9. Web Architecture Index of Terms
  2. About Document Formats by Tim Berners-Lee
  3. A Proposal for Web Metadata Operations by Jim Whitehead
  4. Distributed Objects Mailing List
  5. Infospheres , a mechanism for distributed computations on the Internet using Java
  6. Inserting Objects into HTML with the OBJECT tag (client side includes)
  7. Inserting Scripts into HTML with the SCRIPT tag (client side scripting)
  8. Link Reliability - Why URNs are Not the Answer by Dan Connolly
  9. PEP, the Protocol Extension Protocol
  10. Scripting: Higher Level Programming for the 21st Century by John K. Ousterhout
  11. Toward a Formalism for Communication on the Web by Dan Connolly
  12. W3C Technical Reports
  13. Weaving a Web of Trust by Rohit Khare and Adam Rifkin
  14. WIDL, the Web Interface Definition Language from webMethods to automate the Web
  15. XML, Java, and the future of the Web by Jon Bosak

From Rifkin's List of Links
by adam at xent dot com (last modified June 22, 1997)