Comprehending Atom
by ZetaGecko | Add Your Comments | Atom/RSS
I spent a little time today rereading some of the pages on the Atom wiki and elsewhere in an attempt to understand some issues that aren't spelled out clearly in the specification. I think maybe I understand now. Maybe. I guess I'll start this post with my conclusion: Atom needs a more complete spec, all in one easy-to-find place, before an outsider will be able to implement it correctly.
The question I went looking for an answer to is, what does it mean when they say 'mode="escaped"'? What I found was a lot of discussion about the issue and what looked like conclusions, but there seemed to be different conclusions on different parts of the page, and there are still some questions that I'm not sure I found answers to.
I believed (and having looked, have confirmed) that very generally, "escaped" content is content that has been "entity encoded" (or, I discovered, entered as cdata--check your XML reference for what that means if you're interested). For example, rather than just putting an ampersand (&) in your Atom feed, you enter "&". That's necessary to make well-formed XML, so it made sense that you had to do it for Atom. (It's also what you SHOULD do to display an ampersand in a web page, even though browsers don't force you to).
Where I was a little fuzzy was on the issue of double-escaping. For example, let's say you have a hyphen in your content--not a minus sign--an honest to goodness high-ASCII hyphen (let's assume ISO-8859-1 encoding for the moment). On an HTML page, you can get away with entering the actual hyphen character, or you can use an entity like – or —. So you might think you could enter the actual hyphen character in your Atom feed or other XML document. But in fact, many if not all validators and parser will report an error if you do (I'd have to read the specs a little more to be sure who's right). So apparently you have to entity encode it. But since – and — aren't valid XML entities, you have to go one farther and double-escape it as – or —.
This is where things get unclear. We know (by experience, even if we haven't found the place in the specs that tell us so) that a hyphen has to be double-escaped to get through and XML validator or parser. But we also know that a web browser will accept it in it's raw form or escaped once. So when the Atom folks say that you only have to escape things once, or that they're resolving the double-escaping issue in RSS, what exactly do they mean? I'm guessing that they mean that if you've got your data in a form where it's already "properly" escaped HTML (no raw hyphen characters laying around, for instance), then you can simply convert all ampersands to &, and then all less-than and greater-than signs to < and > (and at least in attribute values, all quotes to ", and you're good to go. My point is, the spec doesn't include a sentence like that. It just says that "escaped" is a valid mode.
Of course, the current spec version is 0.3. I shouldn't expect it to be complete. But people are beginning to implement Atom. I myself will be adding Atom support to some of my code soon. I hope that these issues will be clearly explained in a new specification draft soon. Otherwise, whether the insiders know what "escaped" means or not, we're probably going to get Atom implementations that do things just the same as the RSS implementations the Atom group was trying to fix.
Oh, before anyone decides that I dislike Atom, "Go Atom!"