eXtensible Markup Language:
The Least You Need To Know
Rohit Khare
Adam Rifkin
4K Consulting
Who We Are
Rohit Khare
UC Irvine, MCI Telecommunication,
World Wide Web Consortium/MIT, Caltech
rohit@uci.edu
Adam Rifkin
Caltech, Hewlett-Packard, Rome Labs (USAF), NASA/Langley, College of William and Mary
adam at xent dot com
eXtensible Markup Language:
The Least You Need To Know
The Origin of Document Species
The Ascent of XML
XML: Specifications
XML-LANG, XML-LINK, CSS & XS Styles
XML: Applications
Why Every Bit You Write Should Be XML
Examples...
The Origin of (Document) Species
To be presented at
WWW7, Brisbane, Australia
The Origin of
(Document) Species
The Document Ecology
Evolutionary Adaptations of:
Syntax SGML
Style CSS/XSL
Structure HTML
Semantics XML
The Fossil Record
The Document Ecology
The World Wide Web is the "universe of network-accessible information" [Tim Berners-Lee, 1996]
Openness and Content-Neutrality of Documents
HTTP can adapt to any document format
URL can represent links to any document format, from within many
"Natural Selection" Favors a Few Document Formats
Preferential adoption of SGML, CSS, HTML, and now XML
Each embodies the evolutionary strategy of parsimony
Evolution: Capture Info --> Represent Knowledge
Can leverage Web reflexively to capture structure and semantics
XML-based document formats represent an ecosystem of interdependent (rather than competing) document "species"
Evolution: Syntax
Issues of Concrete Representation
Binary (machine) vs. Text (human) formats
Mission-specific vs. Generic formats
Context-free vs. Turing-complete formats
From Turing-Complete to Declarative
Evolution: Style
Externalized Formatting over Embedded Directives
{\keepn\par\pard\sb240\sl-264 \b1\hyppar0 Warning: do not …}
<P CLASS="WARNING"> …
WARNING { font : bold } ...
<?XML-stylesheet TYPE="text/css" HREF="warning.css"?> <WARNING> …
WARNING { font: bold } ...
Cascading to eXtensible Style Sheets
Rendering: displays, Braille, audio, ...
From Inline Formats to Style Sheets
Evolution: Structure
Anatomy of a Newspaper Article
Logical: headline, byline, body, footer, …
Descriptive: bold, italic, indented, …
Declarative: title, address, keyboard-input, …
Custom declarative: <dateline>, <byline>, ...
Automatic Information Collection
Resource Discovery: For providing useful hints to search engines
Classifying: For cataloguing information content and relationships
Content Rating: For aggregation and filtering
Knowledge Codifying, Sharing, and Exchanging: For processing
From Presentational to Declarative
Evolution: Semantics
How well does the document support the potential uses of its contents?
Scenario: To-Do List Manager
Natural-language Scratchpad
HTML Definition List (Datebook)
XML <DEADLINE AT="iso-date"> Element
Composable, Networked DTDs
Reuse by Linking
From Operational to Well-defined
Citation Encodings
(1/2)
Presentational Text:
"XML, Java, and the Future of the Web", Jon Bosak, World Wide Web Journal, 2(4):219--228, (1997)
J. Bosak, World Wide Web Journal, "XML, Java, and the Future of the Web", Autumn 1997, Vol. 2, No. 4, pp. 219--228.
Presentational HTML:
<I>XML, Java, and the Future of the Web</I>,<I>World Wide Web Journal</I>,<TT>Jon Bosak</TT>,<B>2(4):219-228</B>, <I>1997</I>.
Structural HTML:
<CITE>XML, Java, and the Future of the Web</CITE>
<H3>World Wide Web Journal</H3>
<H4>Jon Bosak</H4>
<UL> <LI> 2(4):219-228
<LI> 1997
</UL>
Citation Encodings
(2/2)
Customized XML:
<BIB><TITLE>XML, Java, and the Future of the Web</TITLE>
<JOURNAL>World Wide Web Journal</JOURNAL>
<AUTHOR> <FIRSTNAME>Jon</FIRSTNAME> <LASTNAME>Bosak</LASTNAME></AUTHOR>
<VOLUME>2</VOLUME> <NUMBER>4</NUMBER> <YEAR>1997</YEAR> <PAGES>219-228</PAGES> </BIB>
XML Metadata Format, using DTDs from the Web:
<?namespace href="http://library.org/bibliography-info" as="BIB"?>
<?namespace href="http://w3.org/schemas/rdf-schema" as="RDF"?>
<?namespace href="http://oclc.org/DublinCore/RDFschema" as="DC"?>
<RDF:serialization> <RDF:assertions href="http://assertions.org/bib-doc">
<BIB:TITLE href="http://sunsite.unc.edu/pub/sun-info/standards/xml/why/xmlapps.htm">
XML, Java, and the Future of the Web </BIB:TITLE>
<BIB:JOURNAL href="http://www.w3j.com/">World Wide Web Journal </BIB:JOURNAL>
<DC:Creator>Jon Bosak</DC:Creator>
<BIB:VOLUME>2</BIB:VOLUME> <BIB:NUMBER>4</BIB:NUMBER>
<BIB:YEAR>1997</BIB:YEAR> <BIB:PAGES>219-228</BIB:PAGES>
</RDF:assertions></RDF:serialization>
The Fossil Record
The Ascent of XML
Joint with Dan Connolly
Fall 97 Web Journal
World-Wide Markup Language
"The HyperText Markup Language is an SGML format." [Tim Berners-Lee, 1991, "About HTML"]
Standard, Open Document Formats since the 1960s (GCA’s GenCode, IBM’s GML)
SGML (ISO 8879:1986) standardized DTDs, doc validity proofs, complex structuring, extensions
1990: Berners-Lee picked tags from a sample SGML DTD and added "killer feature": links
World-Wide HTML tagset embodies 80/20 rule
XML allows Community-Wide Markup Languages,
combining SGML’s power with HTML’s simplicity
Mission of XML
XML was designed to provide an easy-to-write, easy-to-interpret, and easy-to-implement subset of SGML. It was not designed to provide a "one Markup Language fits all" DTD, or a separate DTD for every tag. It was designed so that certain groups could create their own particular markup languages that meet their needs more quickly, efficiently, and (IMO) logically. It was designed to put an end once and for all to the tag-soup wars propagated by Microsoft and Netscape.
---- Jim Cape, June 3, 1997, in a post to
comp.infosystems.www.authoring.html
Community-Wide Markup Languages
Community size is inversely proportional to shared context
Millions agree that <B> means bold, but 2/6/98 reflects local culture
XML decentralizes control of specialized markup languages,
making it cost-effective to capture community ontologies
HTML is not unilaterally extensible (new tags potentially have ambiguous syntax, style, and semantics)
XML is a strict (but simplified) subset of SGML, offering:
Extensibility -- can define new elements, containers, attribute names
Structure -- a DTD can constrain the information model of a document
Validation -- every document can be validated; also, well-formedness can establish conformance to the structure mandated by the DTD
XML includes extensible linking and style formatting also
"Node content must be left free to evolve." [Tim Berners-Lee, 1991, "About Document Formats"]
Evolution of a Standard
W3C Role has Evolved
First guerillas (Generic SGML), then group
Now a movement (GCA, XML ‘98 in March)
Fall ‘97 WWWJ - http://www.w3j.com/xml/
W3C Site is EXTENSIVE
Timeline, Publications, WG Notes, Bookmarks
http://www.w3.org/XML/
List of All Decisions Ever Made, c. Dec ‘97
http://www.w3.org/XML/9712-reports.html
Other Useful XML Links
Our XML Links and Papers Page
http://www.ifindkarma.com/attic/local/xml.html
Robin Cover's Extensive XML Page
http://www.sil.org/sgml/xml.html
Peter Flynn's XML FAQ
http://www.ucc.ie/xml/
XML-Dev Jewels
http://www.vsms.nottingham.ac.uk/vsms/xml/jewels.html
Future Evolution of XML
Coevolution of HTML and XML
All XML is SGML
All HTML is SGML
But all HTML is not XML...
XML profiles SGML:
No Markup Minimization
No Optional Features
Technical Corrigendum harmonizes SGML86
XML: Specifications
Details, Details, Details…
XML-LANG
Self-description with Document Type Defn (DTD):
Structural rules of the document’s markup
External entities, internal entities, non-XML resource entities
Non-Minimization (All Markup is Explicit)
XML Processor - customizable info structures
Most XML users will not know they are using a DTD!
XML-aware software works with custom XML apps without help
Document Validity = Conforms to DTD
Document well-formedness = structurally sound, but no DTD
Required Markup Declaration (RMD) indicates need for DTD
Well-formedness for delivery format, validity for authoring
XML-LANG: Markup Types
Elements - start-tags, end-tags, empty tags
<joke>Take my XML. Please.<applause/></joke>
Attributes - name-value pairs in tags
<warning class="emergency">
Entity References - spec chars, macros, external
Comments - not passed along to application
<!-- Not allowed two dashes before end of comment. -->
Processing Instructions - passed along to app
<?pi-target-name pi-data?>
CDATA Sections - parser ignores markup
*p = &q; b = (i <= 3);
XML-LANG: DTDs (1/2)
Element Declarations - name and content model
<!ELEMENT email (from, to+, cc*, subject, body, sig?) >
<!ELEMENT body (#PCDATA | image)* >
<!ELEMENT image EMPTY >
<!ELEMENT node (desc, node*)>
<!ELEMENT desc (#PCDATA)>
Attribute Declarations - which elements may have what attributes, default and possible values
<!ATTLIST joke
name ID #required
label CDATA #implied
status ( funny | notfunny ) ‘funny’ >
XML-LANG: DTDs (2/2)
Attribute Declarations - name, type, default
Attribute types: CDATA, ID, IDREF/IDREFS, ENTITY/ENTITIES, NMTOKEN/NMTOKENS, a list of names
Default values: #REQUIRED, #IMPLIED, "value", #FIXED "value"
Entity Declarations - associate name with chunk
Internal entities: < > & ' "
or locally defined: ⋔ <!ENTITY fork "4K Consulting">
External entities: <!ENTITY forkfooter SYSTEM "fork/footer.xml">
<!ENTITY pic SYSTEM "http://www.xent.com/4k.gif" NDATA GIF87A>
Parameter entities: <!ENTITY % html.ver "-//W3C//DTD HTML 4.0//EN">
Notation Declarations - external binary data
<!NOTATION GIF87A SYSTEM "GIF">
XML-LINK
Start with HTML, add HyTime and TEI concepts
Simple links point to a single target resource
<A XML-LINK="SIMPLE" HREF="http://www.w3.org/XML">
URL schemes: ftp, http, file, mailto, telnet, nntp, ...
Links can have roles (machine-processible) and human-readable labels associated with them
Can specify default behavior of a link
SHOW - embed in current context, replace it, or start new one
ACTUATE - user must take action (or not) before anything done
Locators - # separates resource name and part id
Connectors - | to return only relevant element
XML-LINK: Advanced
Extended links - can be multidirectional, need not live in the resources they point, link groups
<related-term-group>Hamlet
<related-term HREF="Othello.xml"/>
<related-term HREF="KingLear.xml"/>
<related-term HREF="Macbeth.xml"/>
</related-term-group>
<!ELEMENT related-term-group (#PCDATA | related-term)* >
<!ATTLIST related-term-group
XML-LINK CDATA #FIXED "EXTENDED"
INLINE CDATA #FIXED "TRUE"
CONTENT-ROLE CDATA #FIXED "RT"
… >
Extended Pointers - locate resource by traversing the element tree of its containing document
XPTRs allow links without modifying the containing document
XML-STYLE
CSS1 - rules in terms of element names, Ids, …
XS - based on DSSSL (ISO/IEC 10179:1996)
Formatting specification derived from active style sheets
With XML doc structure, create flow objects (paragraphs, tables, ...) with characteristics (font-name, font-size, …)
Merging, flow object tree determines document layout
Scheme - core expression (math) language in XS
Construction rules declare element style
font-size: 18pt, first-line-start-indent: 20pt, quadding: left
Typically, formatting declared as functions
(element CODE (UNDERLINED-PHRASE))
XML: Applications
July 1997 IEEE Internet Computing
Fall 1997 Web Journal
Data Archaeology
Meishi, or Business Cards
Different shapes, sizes, scripts, demarcations
Two sided, magnetic, photos, public keys
Airline Passenger Name Records
"NQSS5A" means something to airline
Must be manipulated throughout its lifetime
Need stable data format, stable grain of exchange, and common definitions
Every Bit in XML
Across Time
Save Self-Description with Data State
"Future Proofing" of Documents and Data
Across Space
For Exchange as well as Storage of Data
Across Organizations
Building Community-Specific Ontologies
Key to Knowledge Representation
Across Time
Tensions Create Brittle Data Formats
Inertia, Efficiency, Tools, and Context
Future-Proofing Strategies
Machine-Readable - for parsers & generators
Human-Readable - robustness and simplicity
Self-Descriptive - for extraction and validation
XML as Basis to Execute Strategies
e.g., Capturing Database Schema as DTDs
Across Space
Tradeoffs of Marshaling Data
Distributed Systems with Centralized Data
Need: Decentralized, Isolated Subsystems
Strategy: Defer Marshaling Decisions
The Web’s Lesson: Link Instead
Download Networked Resources as Needed
Leveraging the XML-LINK Model
SHOW and ACTUATE; XPTR identifiers
Across Cultures
Challenges to Collaboration
Organizations Defined by their Language
Ontological Problem: Matching Vocabulary
Strategy: Use Documents
"Put It In Writing" Defining Common Terms
Popular Ontologies Can Emerge Organically
Solution: XML-enhanced Documents
Let a Thousand DTDs Bloom on the Web
XML Example Applications
Mathematical Markup Language
Chemical Markup Language
Microsoft’s CDF, Netscape’s MCF
Web Interface Definition Language (WIDL) - Automation
Resource Description Format (RDF) - Composition of schema
Some Software Tools
JSXML - Java Object Stream to XML packages
http://www.camb.opengroup.org/~laforge/jsxml/
Lark - a non-validating XML parser in Java
http://www.textuality.com/Lark/
MSMXL - Microsoft’s XML parser in Java
http://www.microsoft.com/xml/
NXP - a validating parser in Java
http://www.edu.uni-klu.ac.at/~nmikula/NXP/
Many, many others
http://www.ifindkarma.com/attic/local/xml.html
Anatomy of a DTD
From Richard Light,
Presenting XML
Anatomy of a Memo DTD
Creation of Linkable Memos
Name of sender and recipients, date sent, main heading, optional subject keywords, text
Text can contain sections/subsections with optional headings, paragraphs, lists, images and diagrams, quotations from other memos
<!ELEMENT memo (header, links, text)>
is a good start for the DTD, but then what?
Adding Header Declaration
<!-- Parameter entity declarations for later use: -->
<!ENTITY % common-elements "list | image | quote | anchor">
<!ENTITY % a.global "ID ID #IMPLIED">
- <!-- Header information: -->
<!ELEMENT header (sender+, recipient+, date, main-heading, subject*)>
<!ATTLIST header %a.global;>
<!ELEMENT sender (#PCDATA)>
<!ELEMENT recipient (#PCDATA)>
<!ATTLIST recipient %a.global;>
<!ELEMENT date (#PCDATA)>
<!ATTLIST date %a.global;
SORTFORM CDATA #IMPLIED>
<!ELEMENT main-heading (#PCDATA)>
<!ELEMENT subject (#PCDATA)>
Elements within the Text
<!-- Elements within the text of the memo: -->
<!ELEMENT text (section | p | %common-elements;)+>
<!ATTLIST text %a.global;>
<!ELEMENT section (head?, (section | p | %common-elements;)+)>
<!ATTLIST section %a.global;>
<!ELEMENT p (#PCDATA | %common-elements;)*>
<!ATTLIST p %a.global;>
<!ELEMENT list (item)+>
<!ATTLIST list %a.global;
TYPE (bullet|number|plain) "bullet">
<!ELEMENT item (#PCDATA)>
<!ATTLIST item %a.global;>
<!ELEMENT image EMPTY>
<!ATTLIST image %a.global;
SRC ENTITY #REQUIRED
DESC CDATA #IMPLIED
ALIGN (TOP|MIDDLE|BOTTOM) "middle">
<!ELEMENT anchor (#PCDATA)>
<!ATTLIST anchor %a.global;>
<!-- Notation declarations: -->
<!NOTATION GIF SYSTEM "http://www.viewers.org/gview.exe">
<!NOTATION JPEG SYSTEM "http://www.viewers.org/jview.exe">
Quotation Links: Simple to Extended
<!ELEMENT quote EMPTY>
<!ATTLIST quote
XML-LINK CDATA #FIXED "simple"
ROLE CDATA #IMPLIED
HREF CDATA #REQUIRED
TITLE CDATA #IMPLIED
SHOW (EMBED|REPLACE|NEW) "EMBED"
ACTUATE (AUTO|USER) "AUTO"
BEHAVIOR CDATA #IMPLIED>
<!ELEMENT quotation (citation, source+)>
<!ATTLIST quotation
XML-LINK CDATA #FIXED "extended"
ROLE CDATA #IMPLIED
TITLE CDATA #IMPLIED
SHOW (EMBED|REPLACE|NEW) "REPLACE"
ACTUATE (AUTO|USER) "USER"
BEHAVIOR CDATA #IMPLIED>
Out-of-Line Links
<!ELEMENT citation EMPTY>
<!ATTLIST citation
XML-LINK CDATA #FIXED "LOCATOR"
ROLE CDATA #FIXED "citation"
HREF CDATA #REQUIRED
TITLE CDATA #IMPLIED
SHOW (EMBED|REPLACE|NEW) "EMBED"
ACTUATE (AUTO|USER) "AUTO"
BEHAVIOR CDATA #IMPLIED>
<!ELEMENT source EMPTY>
<!ATTLIST source
XML-LINK CDATA #FIXED "LOCATOR"
ROLE CDATA #FIXED "source"
HREF CDATA #REQUIRED
TITLE CDATA #IMPLIED
SHOW (EMBED|REPLACE|NEW) "REPLACE"
ACTUATE (AUTO|USER) "USER"
BEHAVIOR CDATA #IMPLIED>
Linking Example: A Link Farm
DTD with quotation, source, citation, and the following links cover element:
<!ELEMENT links (quotation)*>
<?XML version="1.0"?>
<!DOCTYPE links SYSTEM "links.dtd">
<links>
<quotation>
<citation HREF="memoex2.xml#a1"/>
<source HREF="memoex1.xml#p1"/>
</quotation>
<quotation>
<citation HREF="memoex2.xml#a2"/>
<source HREF="memoex1.xml#p7"/>
</quotation>
</links>
Making the Memo DTD Links-Aware
In the memo DTD, we need an extended link group with a single, fixed document locator pointing to the link farm.
<!-- Fixed GROUP and DOCUMENT elements to reference link farm: -->
<!ELEMENT links (memolinks)>
<!ATTLIST links
%a.global;
XML-LINK CDATA #FIXED "group">
<!ELEMENT memolinks EMPTY>
<!ATTLIST memolinks
XML-LINK CDATA #FIXED "document"
HREF CDATA #FIXED "memo-links.xml">
Sample Memo XML-file
<?XML version="1.0"?>
<!DOCTYPE memo SYSTEM "memo.dtd">
<memo>
<header>
<sender>Adam Rifkin</sender>
<recipient>Rohit Khare</recipient>
<date sortform="19980114">January 14, 1998</date>
<main-heading>XML: The Least You Need to Know</main-heading>
</header>
<links><memolinks/></links>
<text>
<p id="p1"><anchor id="a1"></anchor>I’m almost finished with the slides.</p>
<p id="p2"><anchor id="a2"></anchor>Call me at (626) 449-4123 in an hour.</p>
</text></memo>
Adding Style
<!doctype style-sheet PUBLIC "-//James Clark//DTD DSSSL Style Sheet//EN">
(define memo-style (style font-family-name: "Arial"))
(define para-style (style line-spacing: 13pt font-size: 12pt))
(define header-line-style (style line-spacing: 17pt font-size: 16pt))
(define heading-style (style font-size: 24pt line-spacing: 26pt quadding: 'center font-weight: 'bold))
(define (para-with-prefix prefix)
(make paragraph use: header-line-style
(make line-field field-width: 50pt (literal prefix)) (process-children-trim)))
(default (process-children-trim))
(element MEMO (make sequence use: memo-style (process-children-trim)))
(element P (make paragraph use: para-style space-before: 10pt))
(element main-heading (make paragraph use: heading-style space-before: 20pt))
(element sender (para-with-prefix "From:"))
(element recipient (para-with-prefix "To:"))
(element date (para-with-prefix "Date:"))
(element sender (para-with-prefix "From:"))
(element recipient (para-with-prefix "To:"))
(element date (para-with-prefix "Date:"))
(element links (empty-sosofo))
(element anchor (make paragraph (literal "[fragment of original message was here]")))
XML Editors and Browsers
Useful XML Editor Features
Data validation (with external database)
Insertion of current date on request
Automatic assignment of unique ID to each element
Automatic insertion of elements and attributes required by DTD
Automatic insertion of element data when "reply" button hit
Creating out-of-line links in a user-friendly manner
Useful XML Browser Features
Support for Xpointers (including traversal)
Built-in Support for XS style sheets
Rendering XML fragments embedded in other XML documents
The Philosophy of XML
Future of DTD’s: Maybe…
XML Data - XML all the way down
Metadata collapses into Meatdata
DTD’s become more composable
Legally cascadable DTD’s: RDF
"XML gives Java something to do." [Jon Bosak, 1996]
DOM, XAPI-J, and XML Typing, to name a few
XML: Recommended Reading
XML Books:
Richard Light, Presenting XML, Sams.Net, August 1997.
Web Journal - XML issue, O’Reilly, Fall 1997, http://www.w3j.com/xml/
XML Papers from Khare/Rifkin:
X Marks the Spot: An Introduction to XML http://www.ifindkarma.com/attic/papers/xml/x-marks-the-spot.html
The Ascent of XML (with Dan Connolly) http://www.ifindkarma.com/attic/papers/xml/ascent-of-xml.html
Capturing the State of Distributed Systems with XML http://www.ifindkarma.com/attic/papers/xml/xml-for-archiving.html
The Origin of (Document) Species
http://www.ifindkarma.com/attic/papers/www/origin-of-species.html
XML - World Wide Web Links Galore
http://www.ifindkarma.com/attic/local/xml.html