Let the flame wars begin

Mark Pilgrim calls Kevin Burton "the most dangerous kind of idiot" and of course Kevin cannot help but respond.

One of the issues they disagree upon is how to handle RSS feeds that are not valid XML. Mark wants to parse at all costs by writing an RSS-parser that uses regular expressions instead of an existing XML parser, while Kevin's approach is to simply remove invalid characters and keep using a standard XML parser.

I have given this issue some thought myself and have gone back and forth between writing my own parser, or trying to fix the RSS document so that a standard XML parser can handle it.  I'd handle either approach slightly different than Mark or Kevin though.

If attempting to fix the RSS document, I would not just remove illegal characters or unescaped ampersands, but rather replace them with XML-escaped HTML-entities (&amp;<num>;), or possibly simply embed the surrounding RSS-section in a CDATA. The question is which approach to do when though: CDATA would fix unescaped ampersands, but still does not allow certain characters. At the same time, replacing illegal characters with entities would not work if an RSS document contains embedded HTML that is not valid XML - only CDATA would work for this.

If writing my own parser, I would want to be able to reuse my existing RSS-parsing logic, instead of re-implementing this in a ultra-liberal RSS-parser class. I therefore think it would be better to create an IXMLParser interface, and implement this with two classes: one that simply delegates to an existing XML Parser class, and an ultra-liberal one that would use regular expressions. Try the standard one first, and if that fails, try again using the ultra-liberal XML parser. This also solves one of Kevin's arguments against the roll-your-own-rss-parser approach, which is that you may need to do ultra-liberal parsing for other aggregation formats as well. By creating an ultra-liberal XML (instead of RSS)-parser, it becomes a lot more reusable.