Comprehending character sets
by ZetaGecko | Add Your Comments | XML
After posting yesterday, I did a little more research into why some characters cause XML errors, and learned that if you go by the book, things work as expected, contrary to what real world experience would lead you to believe. It turns out that the hyphen character I was talking about yesterday causes errors because it's not part of the ISO-8859-1 character set, at least not officially! It's one of twenty-four characters added by everyone's favorite monopolistic corrupter of standards, Microsoft. I tried out a few characters (like the copyright symbol) that are part of ISO-8859-1, and they passed with flying colors!
So the deal is this: XML documents are encoded as UTF-8 (a Unicode encoding) by default. If your XML document is in any other character set, you must specify the character set in the <?xml ?> declaration at the top of the document. Then, as long as your document only contains characters that are officially part of that character set, you're well on your way. Next, ampersands (&), less than signs (<) and greater than signs (>) need to be entity-encoded as &, < and > respectively. Finally, quote marks appearing in attribute values delimited by the same type of quote mark must be entity encoded as " (") or ' ('). For example, you have to write <foo bar="This character: " is a quote mark" /> to include a double quote mark in an attribute value that is surrounded by double quotes. Do all this, and character-encoding-wise, you've got a well-formed XML document.
What does this mean for programs generating XML from user input? If a user inputs a character that isn't part of the character set you're encoding your document as, you have to convert it to something else or remove it. The simple, but not-so-helpful approach would be to convert it to a question mark or some other character that says "I don't know what this is". Another approach would be to convert into to a character entity reference. For example, Microsoft's em dash could be converted to . If your document is going to be rendered by a web browser or some other program that understands named entities, you may also have the option of converting it to —, but here, things get tricky. Why? Because while — looks fine to a web browser, it's not a valid XML entity. So what do you do? Before writing it to the XML document, you have to double-escape it as &mdash;.
So, which escaping method is best? Good question. I used to prefer using entity names (—). This makes sense if the document doesn't specify it's character encoding, because different operating systems and different fonts may use a different code to represent a particular character. Using the name allows the program that's displaying the document to choose the correct character. But since XML documents either specify their encoding or are UTF-8, encoding as a number is unambiguous. But wait. If the character is officially part of the character set, you don't have to encode it at all. And if it's not, then unless we're always going to assume the Microsoft extension to ISO-8859-1, it is ambiguous. Conclusion: close your eyes and pick whichever method feels best.
Finally, moving from XML to Atom, what does "escaped" mean? It means that you take your data that contains only characters officially in the character set you're using (any that aren't in the character set having been converted to something that is in the character set or dropped), and entity-encode, or "escape" the ampersands, less than signs, greater than signs, and perhaps quote marks, as describe above.
At least I'm pretty sure that's what it means.