XHTML and Search Engines
Research into what search engines do with documents served as application/xhtml+xml
A Little XHTML Background
In 1999, the W3C released a specification for XHTML 1.0. It was supposed to be identical to HTML 4.01 but expressed in XML rather than SGML.
It hasn't been much of a success, almost all the XHTML you find on the web is masquerading as HTML because Internet Explorer doesn't support the language yet (at the time of writing, the latest version of Internet Explorer is version 7). When it pretends to be HTML the advantages of using XHTML on the client side go away.
Client Support in 2008
Most of the major browsers have supported XHTML for a while and, with Internet Explorer 8 on the horizon, it is just about possible that XHTML support will finally be added to the biggest of the Big Four. Assuming it is, will XHTML be feasible on the WWW?
Most authors care, first and foremost, that their pages work in web browsers, but there are other user agents out there. Coming a close second place to browsers are search engine indexing robots — a very large number of authors want their pages to show up in search engines.
This raises the question: Do search engines support XHTML?
Experiments in Content-Type
I decided to do some experimentation. I created a set of test pages with application/xhtml+xml content types (one well formed, and one with errors that should cause XML parsers to give up and stop processing), along with an index page (served as text/html). Then I waited for search engines to start finding them. If the control page shows up in results, then it is reasonable to assume that all the pages will show up if they are going to.
After publishing my initial results, Mike Davies suggested that the apparent file extension in the URI might be a factor. I've added two new pages, still served as application/xhtml+xml but with .html file extensions. I haven't yet gathered any results from them though.
Here are the most recent results:
| Search engine | Indexed pages | ||
|---|---|---|---|
| Control | XHTML | Not Well-formed | |
| Yes | Yes | Yes | |
| Yahoo! | Yes | Yes | Yes |
| Altavista | Yes | Yes | Yes |
| LiveSearch | Yes | No | No |
| Ask.com | No | No | No |
Additionally, Majestic-12 explicitly states that it does not support anything by text/html and text/plain.
Showing XHTML Pages in Search Results Would Be Silly
It isn't reallyis a surprise that search engines don't show XHTML pages in their results since Internet Explorer users will be presented with a download dialogue box instead of a rendered page if they were to follow the links and users of other browsers will get error messages with the pages that are not well formed., but perhaps they are indexing the pages anyway. If XHTML support becomes common among the browsers that people use, then being in a position to flick a switch and add XHTML results would be an advantage to any search engine.
This leads into my next experiment: I'm currently collecting information about the Accept headers from various search engine indexing bots to see if any of them are asking for XHTML. This is less relevant then it was before the results of the earlier experiment changed to reveal that XHTML documents were being indexed, but might yield something of interest.
History of These Experiments
| 23 January 2008 | My intial post |
| 29 January 2008 | This page created |
| File extensions and accept headers | |
| 3 February 2008 | Initial investigation incorrect |
You can follow updates through the news feed of my blog.