Dorward

XHTML and Search Engines

08 February 2008

A Little XHTML Background

In 1999, the W3C released a specification for XHTML 1.0. It was supposed to be identical to HTML 4.01 but expressed in XML rather than SGML.

It hasn't been much of a success, almost all the XHTML you find on the web is masquerading as HTML because Internet Explorer doesn't support the language yet (at the time of writing, the latest version of Internet Explorer is version 7). When it pretends to be HTML the advantages of using XHTML on the client side go away.

Client Support in 2008

Most of the major browsers have supported XHTML for a while and, with Internet Explorer 8 on the horizon, it is just about possible that XHTML support will finally be added to the biggest of the Big Four. Assuming it is, will XHTML be feasible on the WWW?

Most authors care, first and foremost, that their pages work in web browsers, but there are other user agents out there. Coming a close second place to browsers are search engine indexing robots — a very large number of authors want their pages to show up in search engines.

This raises the question: Do search engines support XHTML?

Experiments in Content-Type

I decided to do some experimentation. I created a set of test pages with application/xhtml+xml content types (one well formed, and one with errors that should cause XML parsers to give up and stop processing), along with an index page (served as text/html). Then I waited for search engines to start finding them. If the control page shows up in results, then it is reasonable to assume that all the pages will show up if they are going to.

After publishing my initial results, Mike Davies suggested that the apparent file extension in the URI might be a factor. I've added two new pages, still served as application/xhtml+xml but with .html file extensions. I haven't yet gathered any results from them though.

Here are the most recent results:

</tr></thead>
<tbody>
<tr>
<td scope="row"><a href="http://www.google.co.uk/search?q=site%3Astone.thecoreworlds.net+Crazy+dancing+telephone+people">Google</a></td>
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>
</tr>
<tr>
<td scope="row"><a href="http://uk.search.yahoo.com/search;_ylt=A0geumL5rZdHLLoA4mJLBQx.?p=Crazy+dancing+telephone+people+site%3Astone.thecoreworlds.net&amp;y=Search&amp;fr=yfp-t-501&amp;ei=UTF-8&amp;rd=r1">Yahoo!</a></td>
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>

</tr>
<tr>
<td scope="row"><a href="http://uk.altavista.com/web/results?sc=off&amp;q=Crazy+dancing+telephone+people+domain%3Astone.thecoreworlds.net">Altavista</a></td>     
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>
</tr>                  

<tr>
<td scope="row"><a href="http://search.live.com/results.aspx?q=Crazy+dancing+telephone+people&amp;go=Search&amp;form=QBRE">LiveSearch</a></td>     
<td>Yes</td>
<td>No</td>
<td>No</td>
</tr>

<tr>
<td scope="row"><a href="http://uk.ask.com/web?qsrc=167&amp;o=312&amp;l=dir&amp;siteid=&amp;q=%22Crazy+dancing+telephone+people%22&amp;search=search&amp;dm=all">Ask.com</a></td>     
<td>No</td>
<td>No</td>
<td>No</td>
</tr>       


</tbody>
</table>    

Additionally, Majestic-12 explicitly states that it does not support anything by text/html and text/plain.

Showing XHTML Pages in Search Results Would Be Silly

It isn't really is a surprise that search engines don't show XHTML pages in their results since Internet Explorer users will be presented with a download dialogue box instead of a rendered page if they were to follow the links and users of other browsers will get error messages with the pages that are not well formed. , but perhaps they are indexing the pages anyway. If XHTML support becomes common among the browsers that people use, then being in a position to flick a switch and add XHTML results would be an advantage to any search engine.

This leads into my next experiment: I'm currently collecting information about the Accept headers from various search engine indexing bots to see if any of them are asking for XHTML. This is less relevant then it was before the results of the earlier experiment changed to reveal that XHTML documents were being indexed, but might yield something of interest.

History of These Experiments

Results (as of February 3, 2008)
Search engine Indexed pages
Control XHTML Not Well-formed
23 January 2008 My intial post
29 January 2008 This page created
File extensions and accept headers
3 February 2008 Initial investigation incorrect

You can follow updates through the news feed of my blog.