Dorward

Comments and CDATA

02 February 2009

In the beginning…

In the beginning there was the browser, and the browser supported HTML. In those days, HTML was a simple language that stood on its own and had no companions to join it on the client.

The duo — dynamic and stylish

Then came JavaScript and CSS (along with a few other languages that are of no consequence to this tale) and they could be linked to the document from external URIs or they could be embedded in the document itself.

So it came to be that HTML 3.2 saw the introduction of the script element and the style element, but this created a challenge that had not previously been encountered in the language, for content was being added to the document that was not intended to be displayed as text.

Upon encountering <font size="+1">Hello, world</font>, a browser which did not recognise the font element would simply render the text without changing the size — an action with no serious consequence. On the other hand, given <script type="text/javascript"> alert('Hello, world'); </script>, a browser would render the code to the script as if it were part of the content, and this would be distinctly unwelcome.

CDATA — changing the rules of content

There was, however, a solution! The script and style elements were defined as containing CDATA, a fact which had the effect of changing the normal rules that applied to characters such as < and & in HTML so they did not indicate the start of a tag or entity reference.

(The specification for CDATA is available if you wish to read more about its effects.)

This had two effects — it made it easy for scripts to use expressions such as if (a<b) without starting a b start tag by accident, and allowed script and style data to be hidden from older user agents.

Comments around the issue

The technique worked on the same principles for both script and style elements and consisted of something which matched the syntax of a comment being wrapped around the inside of the element.

<script type="text/javascript">
<!--
alert('Hello, world');
// -->
</script>

The technique was only supported with compatible scripting and style engines, but was defined in the HTML 4.x series of specifications and no languages of consequence failed to support the technique.

But how was this handed in the clients? The principles are the same for scripting and styling, but we will use the above example for the sake of brevity. For the rest of the document, assume that anything discussing scripting applies to styling as well, but that the syntax might have slight differences.

In supporting user agents

When a user agent which understood the script element encountered this construct, it would parse the script element as a whole, then pass the content to its scripting engine where the script would be executed. At some point in this process the code that looked like a comment would be discarded.

If the user agent didn't support scripting, then the process would be the same, except that the content would be discarded rather than being processed by a scripting engine.

In user agents that lacked support for HTML 3.2

If the user agent lacked support not only for scripting, but for the script element, then it would not recognise it as containing CDATA and would process the content as normal content.

Upon encountering the code that would have been a comment were it not for the CDATA, such a user agent would promptly treat it as a comment and discard the script without rendering it to the user.

It was as simple and elegant approach to the issue (at least so long as nobody attempted to use it on a script which decremented a value with --, a symbol which matched the comment close sequence) as could be asked for if external files were not an option.

Time passes and cults grow

Man did not forget how to speak, apes did not take over the world, and underground dwellers did not start worshipping nuclear missiles, but a cargo cult developed around the comment syntax and people used it, and copy and pasted code in to their web pages without understanding it even though it had been many years since a user agent had been seen which did not understand that the content of script and style elements was not to be displayed to the user as part of the page.

Research says that shows with "X" in the name get higher ratings.

Along came XHTML 1.0, which was almost exactly the same as HTML 4.01. It had a number of changes, some minor some major, but the most important was that it was defined as an application of XML instead of SGML.

XML DTDs are simpler creatures then SGML DTDs and lack the ability to define an element as containing CDATA. The effects of this vary depending on what code is wrapped around the inside of the script, and how the code is being parsed (since in an effort to ease the transition to XML, it was decided to permit XHTML 1.0 to be served as text/html and thus treated with the rules of HTML).

When XML attacks

Most of these issues show up only under the influence of an XML parser (or a validator), in HTML mode they usually just lurk beneath the surface waiting to attack when some poor soul comes by years later and finds a document claiming to be XML and then tries to treat it as such. So let us focus on XML and dip into HTML mode as differences more significant then "The browser error corrects" crop up.

Comments are comments

Since script and style elements in XHTML do not contain CDATA, anything that looks like XML inside the element is processed as XML.

This means that the comment syntax used to protect older browsers is treated as a real comment. With the entire script being wrapped in a comment, there is no script left to pass to the scripting engine and the element is effectively empty.

Naturally, when the document is processed in HTML mode, this does not happen, so there are a large number of pages out that containing scripts and styles that will not apply if the document is ever processed in the proper mode. The cargo cult is strong.

Naked coding

With the comment syntax proving to be a problem, lets take it out — what happens next?

Let us return to the old if (a<b) construction. This is not well formed XML and will cause an XML parser to give up and die (throwing, for instance, the yellow screen of death in Firefox).

The solution? The solution is the same as any other time we wish to represent a < character in a document.

Entity references

if (a&lt;b)

The XML parser will convert the entity reference into a less than sign before handing it off to the scripting engine. Simple, although somewhat tiresome when writing many scripts.

Old demons

At this point, HTML mode rears it's ugly head. In HTML, the element consists of CDATA, so that ampersand character doesn't indicate the start of an entity reference, and the the string &lt; will be passed to the scripting engine instead — where it will promptly throw a scripting error.

Everything in its place

Since we can't use an entity reference while remaining compatible with both modes, we shall have to seek an alternative, and the XHTML 1.0 specification is happy enough to provide us with one.

Use external style sheets if your style sheet uses < or & or ]]> or --. Use external scripts if your script uses < or & or ]]> or --.

CDATA's big trumpet

Now, despite a requirement (according to XHTML 1.0 section 5.1) of XHTML documents that are served as text/html being "follow the guidelines set forth in Appendix C", the authors saw fit to mark Appendix C as "informative" and some people really, really wanted to include their scripts and styles inline.

Consequently, it has become common practise to make use of an XML feature that allows CDATA sections to be indicated explicitly.

<script type="text/javascript">
<![CDATA[
  …
]]>
</script>
HTML mode — the revenge …

Of course, things can't be that easy. In HTML mode, the CDATA markers aren't safe (despite being part of the SGML specification, their use is discouraged due to user agent support).

<script type="text/javascript">
// <![CDATA[
  …
// ]]>
</script>
Foul creatures from beyond the dawn of time

As an aside, I should mention those ancient user agents that the comment syntax was created to protect. Ian Hixie has put together some research on the subject, which you might find useful if you are concerned about such user agents.

The wrap up — a summary and recommendations

Embedding scripts and styles in HTML documents is a complicated issue, mostly due to backwards compatibility considerations. These are my recommendations, in order of preference, for handling them.

Ideally: Don't

The simplest solution is to avoid embedding scripts and styles and keeping them in external files. This has the twin benefits of avoiding all the issues of escaping their content and of making for a more cache friendly website.

If, however, you cannot avoid using embedded scripts:

If it cannot be avoided: Use HTML and don't use comment syntax

Dealing with the compatibility issues involved in XHTML adds a fair amount of complexity increasing the probability of something going wrong. The benefits of XHTML aren't realised by most authors, so the benefits of using the language are usually wasted anyway.

The comment syntax is helpful only to user agents that you are extremely unlikely to encounter and, as such, are not worth worrying about.

If you end up using XHTML, and embedding scripts and styles in it, then use the following technique:

If XHTML is mandatory: Use XHTML with CDATA markers and comments appropriate for the script or style language used.

For JavaScript:

<script type="text/javascript">
// <![CDATA[
  …
// ]]>
</script>

For CSS:

<style type="text/css">
/* <![CDATA[ *//* ]]> */
</style>

Finally…

There has been a reasonable amount of code since I mentioned the ideal solution so don't forget that the best way to deal with scripts and styles is usually to put them in an external file.