Support for XHTML in 2008

23 January 2008

XHTML has been one of my pet hates for a while, and I've ranted about it at length. This isn't going to be another one of my rants made up mostly of pointing to different specifications and complaining about the holes in them (although I'm tempted).

XHTML Desktop: A Web Geek's Fist of Fury by Anthony Baker

One of my biggest concerns with the use of XHTML on the client side today is client support. You can either pretend it is HTML (in which case you have to write something that conforms to XHTML rules which also conforming (as closely as XHTML allows) to HTML rules (and then there is Appendix C)), or you can enjoy very limited support for clients (Internet Explorer being the prime example of a browser which doesn't parse documents served as application/xhtml+xml).

It is now 2008, and Internet Explorer 8 has started peaking its little blue head over the horizon. I like some of what I see and dislike other things. It may or may not support XHTML, it isn't clear which yet.

Let us, however, assume that it will. What other clients are going to be left out in the cold?

Well, assuming the XHTML is well-formed, Gecko-based browsers (including Firefox), Opera and Safari will all be fine. That covers the big four web browsers (we'll leave the question of less commonly used browsers for another time).

Web browsers are not the only clients on the Internet though. There are other user agents which lots of people care about: search engine robots.

Time to construct some test pages then. I built three test pages, all with a phrase that is not commonly used on the WWW.

An XHTML document served as application/xhtml+xml
A slightly modified version of the same document, but with errors in it that prevents it from being well formed.
An index page (written in HTML 4) linking to them

I let them sit on the web for a while so they could be indexed, and then I went after the search engines. I expected the index page to appear, it is plain old HTML, and search engines have never had issues with that. The index page would serve as a control document to indicate when the documents had been indexed.

As for the actual experimental documents, the XHTML document, I hoped would also appear, perhaps with a "Convert to HTML" link akin to those some search engines include for PDF documents. As for the document with errors, initially I hoped it would not show up, but thinking about it now, there is something to be said with falling back to a tag soup slurper when XML parsing of webpages fails (I'm fence sitting on that question, so please don't take the above as my blessing for tag soup slurping XML).

Results (as of January 23, 2008)
Search engine	Indexed XHTML document	Indexed error document
Google	No	No
Yahoo!	No	No
Altavista	No	No

Not the results I hoped for.

When I set this up, I had only been thinking about Google (which goes to show how big a share it has of the market and mindset), but decided to test other search engines at the last minute. As a result, not all of them had gotten around to indexing even the index page.

Search engines that have no indexed any of the pages at the time of writing include Live Search and Ask.com.

I ended up grepping my server logs looking for bots that might have indexed the site. One search engine I found, Majestic-12 looks quite interesting, but answers my question without need for experimentation: The only supported content types are 'text/html' and 'text/plain'..

If you find a search engine that has indexed any of these pages (even if it is negative result as all the ones so far have been) then please let me know. I'm quite happy for people to submit the URL to the test suite (http://stone.thecoreworlds.net/xhtml/) to search engines to help them index it.

So the current state of play is that if Internet Explorer 8 was released today, and suddenly gained huge marketshare, then anybody who cared about placement in search engines would still be unable to fully switch to XHTML.

Of course, we don't know what is going on behind the scenes at search engines. It is possible that a number of them are already indexing XHTML but not including the results in their indexes due to Internet Explorer (and so can flick a switch to turn on support when XHTML capable clients are more popular). I might expand these tests in the future to index an XHTML document that links to a different XHTML document and make sure that is the only link to it. Watching the log files to see if any clients request the second document could provide insights.

For now though, XHTML isn't supported by the browser most people use, and it doesn't seem to be supported by any search engine I can find — so it is still not worthwhile using it client side.

XHTML Desktop: A Web Geek's Fist of Fury by Anthony Baker is used under a Creative Commons license. Danger Risk of Failing by David Dorward is used under a "I took it myself" license but is also available under a Creative Commons license.

Update: Updates to this research can be tracked via the project page.