What is XML, and why should you care?
by ZetaGecko | Add Your Comments | XML
When I first heard people calling XML the next big thing some years back, I gave it a quick look and didn't get it at all. As far as I could see, it was something that looked like HTML, except that the tag names could be anything the publisher wanted and browsers didn't know how the heck to display it. I saw no use for it and moved on with my life. It wasn't until I started working with RSS (which is an XML-based format) that I finally caught the vision.
I think that's how it is with XML.
If you build webpages, it's not too hard to catch the vision of what HTML is all about, because it's easy to see what the tags do. When you see a webpage with interesting formatting, you view the source and learn what HTML tags make it look that way. Very practical.
XML is different, because until you start looking at specific XML formats that actually do something in a program that you use, there's nothing practical about it at all. In HTML, the <b> tag makes text bold. In XML, the <b> tag could mean anything: the second item in a list, what song is on side b of a record, Ben's choice (versus the <c> tag which might be Charlie's choice), etc. The problem is that XML is not a language--it's more like an alphabet. The same alphabet can be used to spell words in English or Spanish. Learning the alphabet without knowing a language is about as useful and interesting as memorizing the shapes of cracks in the sidewalk.
XML is not so much a file format itself as it is a set of basic building blocks for defining file formats--a set of rules for how to create and process documents so that people who write programs to create and process documents will be able to reuse existing code to do a lot of the work rather than having to write everything from scratch. Here are a few of the rules:
* All of the data in a document is contained in "elements".
* Elements must either have a starting and ending tag or be empty.
* Starting tags begin with "<" followed by the element name.
* Ending tags begin with "</" followed by the element name.
* Empty tags begin with "<" followed by the element name and end with "/>".
* If a less than sign (<) occurs in the data (not as the first character in a tag, as described above), it must be "escaped".
* One way to escape a less than sign is to replace it with "<".
etc.
You may have noticed that none of those rules say anything about how a program should display the data. In fact, none of the rules of XML say anything about how to display the data, nor (with a few exceptions) anything at all about what the data means. Remember, XML is not the language, it's the alphabet (sort of). For an XML document, to mean anything, you need to know the language.
Since RSS is currently the most widely known "language" (or more accurately "format") built from the XML "alphabet", I'll use it to explain further. A very basic RSS document might look like this:
<?xml version="1.0" charset="UTF-8" ?>
<rss version="2.0">
<title>My RSS feed</title>
<link>http://example.org/foo/</link>
<item>
<title>My cat can talk!</title>
<link>http://example.org/foo/talking-cat.html</link>
<description>Today, as I was waking up, I'm pretty sure I heard my cat ask for a bagel with cream cheese.</description>
</item>
<item>
<title>My bed rotates during the night</title>
<link>http://example.org/foo/rotating-bed.html</link>
<description>Last night I went to sleep with my head at the top of my bed, but when I woke up, it was at the side. The bed must have rotated during the night.</description>
</item>
</rss>
The only part of that document that XML defines the meaning of is the first line, which says that this document follows the rules of XML version 1.0, and that the document uses the "UTF-8" character set. All of the other tags in the document are only meaningful to programs that know RSS. Other XML-based formats could use the same tag names to mean something completely different. A generic XML processing library could work with the structure of the document, but wouldn't have a clue as to what to do with the data contained inside that structure.
The <rss> and </rss> tags simply mark the beginning and ending of the RSS feed element. Every XML document must have one element like that which contains all other elements in the document. It is called the "document element".
The first "title" element specifies the title of the RSS feed, and the first "link" element specifies a link for the feed--usually a link to a webpage that contains the same data in HTML format.
There are two "item" elements in the document, each of which, in this case, relates to one entry in a weblog. Each item element contains a number of other elements which specify something about the item element that contains them--its title, its link, and its description (which is usually an excerpt from or a description of what you'll find if you follow the link).
So how does an RSS reader display the feed? Usually, the title is displayed as a clickable link to the URL from the link element, with the description below, but none of that is required. Font sizes, colors, how much of each piece of data to display, etc., are all up to the program displaying the feed. The feed itself has no way of specifying those things, because RSS doesn't define any way to specify formatting.
Okay, that's not entirely true. The description element can contain HTML tags which specify how to format the description (though in practice, many RSS readers remove some or all HTML tags from the description before displaying it). For example, the following description element says to make the work "love" bold:
<description>I <b>love</b> sushi!</description>
You'll notice that I didn't write:
<description>I <b>love</b> sushi!</description>
That's because "b" isn't an RSS element name, so as noted above, the "<" characters in the HTML bold tags must be escaped (and many publishers also escape the ">" tag too, although it's not required).
So XML doesn't specify anything about display formatting, and RSS doesn't define any tags that control formatting either (it leaves that up to HTML). Does that mean that XML can't be used to specify display formatting? Not at all. The most obvious counter-example is XHTML. XHTML is almost identical to HTML, except that a few rules have been added or changed to ensure that XHTML documents follow the rules of XML. For example, the line break tag must be written as an empty tag (eg. <br/>) rather than simply as <br>. The XHTML language is not prevented by the rules of XML from specifying how the data is to be formatted.
I've gone on long enough for now about what XML is. But I need to get back to one of the questions in this article's title: why should you care what XML is? The answer to that question is that until you start working with documents in a particular XML-based format (like I did when I started working with RSS feeds), there's little reason for you to bother studying the specifics of XML. If you understand the gist of this document, you probably know enough.