next up previous contents
Next: XML Schema Up: Principles of SOAP Web Previous: Principles of SOAP Web   Contents


XML - Extended Markup Language

The World Wide Web (WWW) as we know it consists of computing nodes serving documents which contain information. These documents are stored in several data formats, most often in the ``Hyper Text Markup Language'' (HTML). Other data formats are proprietary and can only be read by specific applications. Many of these data formats describe presentation issues but most often do not deal with the meaning of the document. [DAC01] points out that this leads to information overload and poor content aggregation.

To overcome these problems, documents have to be structured and the meaning of the contained data has to be added. One mechanism for doing this is markup. [XML1] states ``Markup is a method of conveying metadata (information about another dataset).'' All SGML-based9 languages use so-called ``tags'' for separating pieces of information into so-called ``elements''. These tags add the needed metadata for describing the meaning of the elements.

One widely adopted markup language is the ``Hyper Text Markup Language'' (HTML), which is currently used for most of the documents available on the Internet. HTML contains a fixed set of tags that are used to add formatting and presentation logic to documents but lacks the possibility to define new tags which is required to add other than formatting metadata to documents.

In the end of the 1990s the World Wide Web Consortium designed an extensible markup language called ``Extended Markup Language'' (XML) that combines the flexibility of SGML and the widespread acceptance of HTML.

The basic structure of an XML document is best explained by a simple example as given in Listing 1. The first line is a so-called ``processing instruction'' (PI) and provides commands and information to the XML parser. In this case the parser is told that the document complies to the XML 1.0 standard. The third line is a comment that is used for documentation purposes and is ignored by the parser.

The rest of the document consists of various elements which are arranged in a ``1:n'' parent/child structure10. The first element - the XML standard calls it the ``root element'' - opens with the tag currency list and has an attribute which specifies its date. The contents are multiple currency elements which also have attributes and contain other elements. XML documents can be modeled in a tree-like structure as shown in figure 9.

language=XML
\begin{lstlisting}[caption={Simple XML Code},label=ex_xml_basic]
<?xml version='...
...ame>
<change>10.73175</change>
</currency>
</currency_list>
\end{lstlisting}

Figure 9: XML tree of the given example.
\begin{figure}\centering
\includegraphics[scale=0.7]{graphics/xml_tree.eps}\end{figure}

XML has the following key features:

Extensible:
An important issue of XML is that tags can by freely specified. This means that metadata of any kind can be added to XML documents.

Legible to humans /easy to create:
XML documents are normally created and edited by specific tools, called ``XML parsers'' but there can be specific cases (e.g. debugging or testing) where the documents have to be edited by hand. XML has the advantage that it is legible to humans, moreover, there are editors available that support syntax highlighting and structuring of the document, which makes the editing of XML documents very easy.

Verification of syntax and semantics:
XML documents can be checked for syntactic and semantic correctness by the XML parser. If the document is syntactically correct, it is called ``well formed'', which means that it fulfills the XML standard. To check the semantics of a document, the user has to supply information about the document's structure and its grammar to the XML parser, which is then used for validation. For this task, the XML standard suggests two formal description languages, namely the ``Document Type Definition'' (DTD)11 and ``XML Schema'', which is explained in section 2.2.

Namespaces:
When combining structures with different vocabularies into one document, naming conflicts can occur. XML solves this problem with the use of so-called namespaces. [BIR01] describes namespaces like this: ``An XML Namespace is simply a group of names, usually with a related purpose or context, where the group has a globally unique name (the ``namespace name''). This is often ensured by using a domain name (from Internet DNS) as the first part of the namespace name.'' The namespace concept is best described with a simple example, given in listing 2. Here two namespaces are defined which refer to two different vocabularies which both contain the word ISO but have a different meaning. With the namespace prefixes cur: and cnt: these two words can be distinguished.

language=XML
\begin{lstlisting}[caption={XML Namespace Example},label=ex_xml_ns]
...
<cur:cu...
...:iso>USD</cur:iso>
<cnt:iso>US</cnt:iso>
</cur:currency>
...
\end{lstlisting}

XML has gained broad acceptance and is used in various fields. Many data formats, languages and protocols are XML-based and it is expected that XML will obsolete various other data formats. The protocols and languages described in the next sections are all XML-based.


next up previous contents
Next: XML Schema Up: Principles of SOAP Web Previous: Principles of SOAP Web   Contents
Hermann Himmelbauer 2006-09-27