London Perl Workshop 2007: Markup Validation and Perl

03 December 2007
    <p>
        Having lived in London for about six months, I've finally made it to a <a href="http://london.pm.org/">London.pm</a> event. Saturday 1st December was the annual <a href="http://conferences.yapceurope.org/lpw2007/main">London Perl Workshop</a>, a day of talks related to Perl organised by the London Perl Mongers with free admittance. I highly recommend them to anyone who can make it.
    </p>
    <p>
        This year I made the mistake of popping into the <a href="irc://irc.perl.org/#london.pm">#london.pm IRC channel</a> just in time to see Greg mention that he was looking for three more lightning talks. Somehow my fingers decided to volunteer me without engaging my brain. So I found myself recycling my <a href="http://dorward.me.uk/talks/2005/validator/">2005 talk on the relationship between Perl and the W3C Markup Validation Service</a> - with four days to prepare.
    </p>
    <p>
        I didn't want to reuse the slides I used last time, since they suffer somewhat from containing most of the content of the talk in pure text, and I found myself too distracted to put together new slides, so I ended up running it without visual aids (to the delight of acme).
    </p>
    <p>
        I don't speak in public very often, and somehow my talk managed to mutate from "Not enough to fill the time" when I was going over it to "Too long" when I actually did it. This resulted in the last minute turning into blind panic (I've never run out of time before) so here I present the text that I intended to use. If you saw me speak, you might find it a little more intelligible, and if you didn't, then you might just find it interesting.
    </p>
    <p>
        We don't like bugs, so when we write Perl we try to avoid them with <code>use strict</code> and <code>use warnings</code> so that perl complains if we do something problematic. Dealing with <a href="http://en.wikipedia.org/wiki/Markup_language"><dfn>markup</dfn></a>, and HTML in particular, is a different story. We have lots of different parsers, and much of the time they attempt to perform enormous amounts of error recovery, so that they can show the user something that at least resembles what they are looking for. The XML specification requires parsers to <a href="http://www.w3.org/TR/xml/#sec-terminology">stop normal processing</a> when they run into a well-formedness error, but that <a href="http://www.xml.com/pub/a/2003/01/22/dive-into-xml.html">doesn't stop some parsers from carrying on anyway</a>, nor does it deal with problems where there is an error in the choice of which elements and attributes to use and where to put them.
    </p>
    <p>
        This brings us to the wonderful world of <a href="http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.3">DTDs</a> and validation. A DTD is a machine readable file that describes what elements and attributes can be used in a document type, and where they can be used, along with one or two other things. It isn't perfect, some things can't be expressed in it, but it is simple and makes for an excellent base for a first level of quality assurance. The DTD can be read by a validator, the most famous of which is probably the <a href="http://validator.w3.org/">W3C Markup Validation Service</a>, and compared to a markup document.
    </p>
    <p>
        Large chunks of the W3C Markup Validator are written in Perl, and it is released under a free license, so you can <a href="http://validator.w3.org/source/" title="Download the W3C Markup Validation Service source code">download it</a>, run it locally and hack at the source code.
    </p>
    <p>
        As well as the well known <a href="http://validator.w3.org/">web form interface</a>, the Validator provides <a href="http://validator.w3.org/docs/api.html">a few other methods of getting your code tested</a>, including a <a href="http://validator.w3.org/docs/api.html#soap12format">SOAP interface</a>. We don't all want to be writing SOAP though, not when we have a friendly CPAN module to install. <a href="http://search.cpan.org/~struan/WebService-Validator-HTML-W3C-0.22/lib/WebService/Validator/HTML/W3C.pm">WebService::Validator::HTML::W3C</a> takes all the pain out of accessing the web service. Unsurprisingly it lets you specify which server to talk to, so you can use your fast local server rather than the heavily subscribed public instance hosted at the W3C.
    </p>
    <p>
        So now we know how to programatically validate markup, so what? Well, we can use a little loop to batch validate documents, glue a validation phase into our content management systems, and generally test any HTML or other markup that comes into our systems.
    </p>
    <p>
        Now, lets say that we have a website, and it is built from a few templates written by some people who are really good at HTML, CSS and JavaScript and some content, written by people who have had some basic training in HTML.
    </p>
    <p>
        Do we want these, hypothetical, content producers loose on the page without any restrictions? Possibly not, but since we can validate against any DTD we like, we can create a cut down HTML-lite.
    </p>
    <p>
        So lets say goodbye to the script element, the style attribute, and anything else we don't want appearing in the pages, and then we can let them loose on it.
    </p>
    <p>
        Sounds nice? Well, it does have some limitations.
    </p>
    <p>
        The first is that what can be expressed in a DTD is relatively limited, we can't, for instance, ban URIs which use the javascript pseudo-scheme(?), so while validation makes a great start to QA, it doesn't end it too. However, once the input is valid, you've got a host of other tools that can be used to do further checking. I'm quite fond of <a href="http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/TreeBuilder.pm">HTML::TreeBuilder</a>, which isn't the fastest module on CPAN, but gets a decent node tree in memory that can be searched for things you want to avoid.
    </p>
    <p>
        The second limitation is that some of the error messages outputted by the validator can be a little hard to interpret if you don't have a really good understanding of the language. The Validator has some very detailed explanations for the usual suspects for most errors, but sometimes authors need to consult with an expert.
    </p>
    <p>
        This is all very well if you are a content producer for an organisation with some good web developers on hand to help you out when you run into trouble, but what if we were attaching a validation routine to the comment system on a public website. Joe Random Visitor probably doesn't have a web development team to call on for aid.
    </p>
    <p>
        Enter <a href="http://search.cpan.org/~petdance/HTML-Tidy-1.08/lib/HTML/Tidy.pm">HTML::Tidy</a>, a wrapper around another project that was kicked off at the W3C. Tidy is an HTML lint, formatter and can attempt to repair bad HTML.
    </p>
    <p>&hellip; and then wrap up.</p>