In which the author attempts to explain how XML mechanisms will pave the path for revolutionary new applications in the World Wide Web in the areas of markup, transfer, archiving and persistence, automation, distributed transactions, metadata, data exchange, database interchange, Web-based sharing, distributed objects, and distributed computations.
Disclaimer: This page is for my own personal reference only; it is not intended to be an authoritative word on ANYTHING. Email me with suggestions on how to make this writeup better.
HTML is the HyperText Markup Language standardized by W3C for storing and exchanging documents on the World Wide Web. HTML was designed to be simple enough to support ease of authoring Web pages, rich enough to support multimedia embedding in documents, and flexible enough to support hypertext linking.
HTML is based on SGML, the Standard Generalized Markup Language standardized by ISO for defining and using portable document formats. SGML was designed to be formal enough to allow proofs of document validity, structured enough to handle complex documents, and extensible enough to support management of large information repositories.
W3C's SGML working group, with present efforts given in their activity page, is attempting to standardize the delivery (in Web documents) of self-describing data structures with arbitrary depth and complexity. To that end, they are simplifying SGML for use with the Web (and Web technologies such as Java).
XML, Extensible Markup Language, is a simplified (but strict) subset of SGML that maintains the SGML features of validation, structure, and extensibility. XML is a standardized text format designed specifically for transmitting structured data to Web applications. In addition, XML's goals of being easier to learn, use, and implement than full SGML will have clear benefits for World Wide Web users, making it easier to define and validate document types, to author and manage SGML-defined documents, and to transmit and share them across the Web.
The Extensible Markup Language specification describes XML documents, a class of data objects stored on computers, and partially describes the behavior of XML processor programs used to read XML documents and provide access to their content and structure. XML allows generic SGML to be served, received, and processed on the Web in a manner similar to what is done with HTML today. XML has been designed for ease of implementation and for interoperability with both SGML and HTML.
XML documents are composed of entities, which are storage units containing text and/or binary data. Text is composed of character streams that form both the document character data and the document markup. Markup describes the document's storage layout and logical structure. XML also provides a markup mechanism to impose constraints on the storage layout and logical structure of documents.
XML, like SGML, is a meta-language for describing the markup of different types of documents. However, its specification is 26 pages (versus 500 for SGML!). The W3C hopes that offering a simplified version of SGML will make implementing SGML much more palatable to vendors of Web authoring and browsing tools.
XML is not a replacement for SGML. Many features of SGML were left out to keep XML simple. Current SGML users may choose XML for network delivery, and since XML is a valid subset of SGML, the translation from SGML to XML is straightforward. XML was developed as an easy on-ramp to SGML for people who are not yet using it.
To simplify SGML, the W3C working group dropped support for certain features that put a heavy processing burden on SGML client software. For example, a well-formed XML document is unambiguous, so a browser or editor can read the tags and create a tree of the hierarchical structure without having to read its document type definition. XML also does not allow markup minimization, requires that empty elements be self-identifying, and does not support several other complex SGML standard features.
XML is not a replacement for HTML, either: HTML is a useful tool for storing and exchanging small hypermedia documents across the Internet. Furthermore, it is easy to generate HTML documents on the fly from XML (or SGML) documents. XML is designed to complement HTML by enabling different kinds of data to be exchanged over the Web.
For example, current limitations in World Wide Web technologies do not allow the extensibility, structure, and data checking necessary for large-scale commercial Web publishing. Jon Bosak's excellent paper "XML, Java and the Future of the Web" explains how XML can enable advanced Web applications, allowing Java applets to embed powerful, automatable data manipulation facilities directly into Web clients.
Unlike HTML, which has a fixed (though ever-changing) set of tags, XML lets you define your own tags and attributes. Support for XML by the Internet community would open up vast new possibilities for Internet publishing: instead of shoehorning all documents into HTML, or having to invent a browser to handle non-HTML documents, XML would enable a wide array of documents with user-defined tagsets to be handled by generic Web application software. As Tim Bray pointed out, "[XML allows us to] finally get off the HTML treadmill."
Presently, an author can create rich documents with an application, and then use a Java applet viewer to attach those documents to Web pages. As long as the browsers continue to provide only crude formatting, such measures are unfortunate but inevitable, much in the same way people use desktop publishing applications to get better typography than can be done with off-the-shelf word processors.
But there is no reason why the concept of a "basic Web page" needs to be limited to a single tag set! The appeal of the Web is its simple hypertext scheme, which provides a simple, unambiguous method of pointing to files with unique names. Although it is handy that HTML is also simple, the success of word processors has demonstrated that consumers can cope with multiple document types.
When XML becomes more widespread, Web authoring tools will become much more flexible in handling basic document constructs. WordPerfect and Word will export directly into XML, using the style names as tags instead of filtering everything into 90 (or however many currently exist) predefined tags.
In such a brave new World Wide Web, Java's role will be to do interesting things with the content, such as mediation between formats, computation and event handling, automation of tasks and dynamic content, presentation of different views to different viewers, and even intelligent filtering of content. XML specification co-editor Tim Bray succinctly put it, "XML gives Java something to chew on."
If you are a complete novice, to start learning about what XML provides (that HTML does not), read Dave Siegel's article, "The Web is Ruined and I Ruined It". I have also collected several other links for XML beginners.
Bert Bos' simple XML is a good place to find a few examples of XML files. He also describes the XML data model, which represents information content of an XML document as the linearization of a tree structure with several character strings at each node of the tree.
For answers to frequently asked questions about XML, see the overviews at Textuality and University College Cork. I have also collected several advanced XML links and XML-related issues and papers.
XML provides a standard way for information providers to add custom markup to information-rich documents, so that complex documents can be rendered (and published) in a dynamic way. Compare this with HTML: although HTML is easy to use, its simplicity significantly constrains how publishers and users can represent and use documents and databases. XML provides the means to publish and receive any information, regardless of format or origin, in any way desired.
With XML, users will be able to manage documents dynamically in a Web browser, allowing the presentation of personalized Web-based information. Content providers will be able to distribute structured databases that users can manipulate at will, due to several features XML (like SGML) provides that are not available in HTML:
The working draft for XML 1.0 provides a complete specification in two parts: the extensible markup language itself, and methods for associating hypertext linking and stylesheet mechanisms with XML. From this specification, we observe that expressive power, teachability, and ease of implementation, were all major design considerations. And although XML is not backward-compatible with existing HTML documents, we note that documents that are HTML 3.2-compliant can easily be converted to XML, as can documents conforming to ISO 8879 (SGML) and documents generated from databases.
As far as XML acceptance is concerned, Microsoft has embraced XML for future releases of its Internet Explorer, and Netscape is considering doing the same for its browser. XML will most likely gain widespread acceptance once the power of XML-enabled applications is realized. Please see Jon Bosak's paper for detailed descriptions of some XML-enabling applications; according to Bosak, these driving applications will include:
Although scripts embedded in HTML documents may provide some solutions to these problems, XML does not commit authors to individual script languages, authoring tools, and delivery engines. Instead, XML offers an "open standards" solution that is vendor-independent, giving information providers greater control over their content and context.
We note that the current XML draft only addresses syntax, and consequently XML alone can at present only be used for interprocess communication and for the delivery of documents to specialized applications (or plug-ins) that have been configured to interpret a particular XML grammar. However, much more work on XML is on the way.
After writing "Webs of Archived Distributed Computations for Asynchronous Collaboration" for the Journal of Supercomputing with my coauthors in the Caltech Infospheres Project, my discussions with Rohit Khare led to my understanding that XML can provide useful mechanisms for archiving information, denoting machine-readable and person-readable description of document contents, automating services, and mechanizing persistent transactions.
After perusing Rubinsky and Maloney's SGML and the Web, my hunch is even stronger that XML is the ultimate persistence format, simply by tagging everything properly so it fits into the larger schema of the Web. I may decide to use this idea in conjunction with infospheres sessions or elsewhere in the infospheres framework for my PhD work here at Caltech.
As discussed with Mark Baker in an epiphany or maybe two on the dist-obj list, XML can also be used for automation and self-description. Instead of just having Java view objects, business components can negotiate with an XML container for presentation and data exchange. Since any XML document can contain a description of its grammar, applications can use this description to perform structural validation. Plus, XML includes the ability to define new tags so that XML documents can become self-describing containers (so I can tag the data "Adam" as the value of an attribute FIRSTNAME for the tag <PERSONNAME>). Can you see how this is quite similar to WIDL both in design philosophy and actual implementation?
In short, XML provides a mechanism for streaming self-defining data containers into a browser, from an object... all using a simple document model! Furthermore, this simple document model is useful both to users of the Web, and to users of CORBA-compliant objects... which begs the question, what IS the difference between an "object" and a "document"? Turns out, not a whole lot...
Although most people still do not realize it, both the "Objects" community and the "Web" community share the same vision, including a somewhat similar architecture: Web servers are document factories, and likewise, object-request brokers are object factories. This was the very premise of the joint W3C/OMG workshop on distributed objects and mobile code in 1996, for which we provided a position paper on infospheres. Our paper, like many other papers for that workshop, held the firm belief that objects and Web documents are two sides of the same coin.
Combine this observation with the fact that A DOCUMENT CAN BE ANYTHING -- any bag of bits on the client-side and/or server-side -- from
Let's go a little further in exploring documents and objects. A document is a bag of bits and an explanation of how those bits work. So is an object. A document can have a specified interface. So can an object. A document can have specified behavior for itself and/or its components. So can an object. A document can provide semantic clues to what it does. So can an object. A document can be distributed. So can an object. A document can encapsulate other documents. An object can encapsulate other objects. A document can inherit its type from another document, as can an object. Document types can be polymorhpic, as can object types.
A document can send a message to another document. A document can have a notion of an "active thread of control." A document can invoke a method on another document.
Sometimes browsers and other users of documents need to know about those documents to check the meaning of things, parse the contents, and use the documents effectively. Metadata is the solution to this problem, and I believe XML is a fine solution for metadata.
Jim Whitehead's paper "A Proposal for Web Metadata Operations" defines metadata to be "information about information... Information on the Web, known as Web resources, have many pieces of associated descriptive information which is often not explicitly represented in the resource itself. Examples of metadata include the creator of a resource, its subject, length, publisher, creation date, etc. Such descriptive metadata can be used to make information easier to locate by improving Web searches, rate information to protect children from indecent content (e.g. the Platform for Internet Content Selection (PICS)), capture copyright information, contain a digital signature, or store cataloging data. Many other uses are also possible."
Jim Whitehead goes on to declare that "Another type of metadata is the relationship. A relationship captures an association between two or more resources, and can be one to one, one to many, or many to many. Relationships can be used to capture navigational relationships, such as "go to this resource next," or a table of content, and can also express hierarchies (parent/child, successor/predecessor) Relationships have many domain-specific uses, such as a piece of software which has many "implements" relationships with a requirements document. Annotations are another use of relationships in which the relationship points to commentary material on the resource. The use of relationships to capture associations between data items is an old idea, stemming from semantic data modeling, and early hypertext work on the NLS and Xanadu systems."
Now, let's take this a level further. Metadata on the Web should be applicable to resources of any media type. Note that on today's Web, many resources are not HTML (to name a few examples, Java applets, Adobe Portable Document Format, and Adobe Postscript). Metadata on the Web should provide descriptive information about Web resources of any media type, including those that have no built-in provision for storing general purpose metadata (and never will).
Rohit Khare identified some overlap between some current metadata proposals:
While these metadata proposals are in some sense roughly related (they all use the word metadata), most of them are complementary technologies. For example, Dublin Core, MARC, and RFC1806 (the Dienst bibliography format) are all bibliographic record formats, created (as you would hope) by researchers from the digital libraries community. These formats were NOT intended to solve the general-purpose Web metadata problem -- for example, none of these bibliographic record formats can effectively convey PICS-like rating information. On the other hand, PICS is not a good bibliographic record format. Thus, PICS and Dublin Core/MARC/RFC1806 can peacefully coexist.
Jim Whitehead's "Proposal for Web Metadata Operations" gives a framework that explains the relationship between "large chunk" metadata proposals like PICS, PICS-NG, Dublin Core, MARC, Web Data, etc. and the "small chunk" HTTP extensions proposal. His proposal contains an extensive hyperlinked reference section, which makes it easy to track down the source material being described. The remainder of this section is quoted from Jim Whitehead himself off the FoRK mailing list.
Basically, the proposal extends the HTTP object model to create a new area for state storage within a resource, to be used for the storage of name/value metadata pairs. While there is no effective upper bound on the length of a metadata item (and hence you could make a name/value pair like "PICS-label", "{an instance of a PICS label}", typically you'd want to create a link on the resource which points to the PICS label which is itself stored as a separate resource. Methods are introduced to create name/value pairs (ADDMETA), delete name/value pairs (DELMETA), and to access name/value pairs (GETMETA). The GETMETA method is bundled with a simple s-expression like search syntax, so if you want to get a listing of all the attributes on a resource you'd pass a search specification of (OR (AND (name "*")(value "*"))). Hypertext links are defined as a special type of metadata with some constraints on the format and semantics of the value of the link name/value pair (e.g., name="DAV:/Link", value="Type = {token} Source = {URI} Dest = {URI}").
The WebDAV proposal supports small chunk metadata and large chunk metadata. It doesn't address packaging issues, because there are already many proposals for how to package metadata. Far fewer proposals actually address how this metadata is stored and associated with the resources they describe. Because the proposal is implemented via HTTP, it also provides the ability to store metadata on resources of *any* content type, not just HTML. The WebDAV proposal describes "how" metadata is stored and associated, while efforts like Dublin Core, PICS-NG, Web Data, etc., describe "what" metadata is stored, and its packaging.
Thus the WebDAV proposal is complementary to packaging efforts such as Dublin Core, PICS-NG, Web Data, XML, Digital Signature manifests, and so on.
Regarding the Protocol Extension Protocol, PEP only describes extensions to HTTP that involve adding new headers to modify the semantics of existing methods. WebDAV is proposing to add several new methods to HTTP, and hence is outside the scope of PEP. This applies equally to methods like COPY and MOVE as well as to methods like GETMETA. As for HTTP purity, Roy Fielding was present at the meeting where they crafted the GETMETA method (he helped write the BNF for the search syntax), and there are few others who can claim the mantle of "HTTP purist" more effectively than he.
That aside, I still think XML has an excellent role to play in the evolution of metadata for the Web. Then again, I think trust on the Web is really important, too. It's definitely something to think about. Meanwhile, enjoy some of the links I've collected pointing to some of the best XML information I could find on the Web.