public virtual MemoryStream: .NET XML Serialization Bug

Thursday, December 18, 2003 12:07 AM

.NET XML Serialization Bug

There is a bug in .NET's XML Serialization routines that can serialize an object to XML which will subsequently throw an exception when you attempt to deserialize it. This happens when a string that is being serialized contains an unprintable character like char(1). This will get serialized to , but fail on deserialization :-(

The following code will reproduce this bug:

using System;
using System.Xml.Serialization;
using System.IO;

public class XmlSerializationTest
{
  public static void Main()
  {
    try
    {
      String s = ""+(char)1;
      StringWriter sw = new StringWriter();
      XmlSerializer xs = new XmlSerializer(typeof(String));
      xs.Serialize(sw, s);
      System.Console.WriteLine("String encoded to XML = \n{0} \n", sw.ToString());
      StringReader sr = new StringReader(sw.ToString());
      String s2 = (String)xs.Deserialize(sr);
      System.Console.WriteLine("String decoded from XML = \n {0}", s2);
    }
    catch (Exception e)
    {
      System.Console.WriteLine(e);
    }
  }
}

Running this will produce the following output:

String encoded to XML =
<?xml version="1.0" encoding="utf-16"?>
<string>&#x1;</string>

System.InvalidOperationException: There is an error in XML document (2, 2). --->
 System.Xml.XmlException: '', hexadecimal value 0x01, is an invalid character.
Line 2, position 12.
   at System.Xml.XmlScanner.ScanHexEntity()
   at System.Xml.XmlTextReader.ParseBeginTagExpandCharEntities()
   at System.Xml.XmlTextReader.Read()
   at System.Xml.XmlReader.ReadElementString()
   at System.Xml.Serialization.XmlSerializationReader.ReadNullableString()
   at Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializationReader1.Read2_string()
   --- End of inner exception stack trace ---
   at System.Xml.Serialization.XmlSerializer.Deserialize(XmlReader xmlReader, String encodingStyle)
   at System.Xml.Serialization.XmlSerializer.Deserialize(TextReader textReader)
   at XmlSerializationTest.Main()

whoops...

Unfortunately, this bug affects SharpReader since it uses XML Serialization to save and load its data. As soon as a feed item contains a char(1) (like this entry does), a subsequent save and load of the containing feed will blow up and SharpReader will lose all entries for that feed :-(

Now the good news is that before throwing away the unreadable XML, SharpReader does make a backup copy of it. I'll try to add some code to handle this problem by either removing or escaping the troublesome entities in order to make the files readable again. I should be able to scan for .error files and, if fixable, merge their contents with the current feed data...

TrackBack URL for this entry: http://www.hutteman.com/scgi-bin/mt/mt-tb.cgi/112

Comments

Have you tested with forcing serialization to try other encodings? i.e. UTF8 instead of 2-byte Unicode?

Posted by Ilya Haykinson at December 18, 2003 3:46 AM

I have not looked at the XML 1.0 spec in a while but I believe that characters below hex 20 with the exception of TAB, CR, and LF are illegal in XML. Thus, if correct, it doesn't surprise me that it has a problem with deserializing Ctrl-A. Chalk a failure by Microsoft programmers to check on serialization and throw an error. The deserialization error message, appears correct to me, given what I believe the XML 1.0 spec says.

Posted by Andrew Houghton at December 18, 2003 7:27 AM

After playing with one of my programs for a while, I was unable to reproduce this error. However, unlike you, I serialize with UTF-16 encoding, not UTF-8.

Posted by Chris at December 18, 2003 7:41 AM

Ilya: SharpReader uses utf-8 and has the same problem.

Chris: My little test-app above did not specify an encoding and generated utf-16 as you can see in the generated xml in my post

<?xml version="1.0" encoding="utf-16"?>
<string>&#x1;</string>

Are you saying you tried the code above on your system and it does not blow up for you?

Posted by Luke Hutteman at December 18, 2003 8:32 AM

Andrew: You're right, The the spec states:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

So technically, the bug is in .NET's serialization code, not its deserialization code.

It's still a bug though - if they serialize something, they should be able to deserialize that as well.

Personally I'd prefer it if they would be slightly spec incomplient on this and fix it in the deserialization code instead of throwing an exception on the serialization end though. Otherwise, some strings would be unserializable and in complex objects it would be hard to figure out which string contained the illegal characters.

Posted by Luke Hutteman at December 18, 2003 8:49 AM

Change your code to

String s2 = (String)xs.Deserialize(new System.Xml.XmlTextReader(sr));

and it should work fine.

Posted by Dare Obasanjo at December 18, 2003 11:32 AM

Why cant you use binary serialization? Is it also effected by this?

Posted by Rob Chartier at December 18, 2003 12:41 PM

Dare: thanks, that works.

I see that the difference between Deserialize(TextReader) and Deserialize(XmlReader) is that the TextReader variant creates an XmlTextReader with its Normalization property set to true, whereas this property defaults to false for the XmlTextReader.

The docs do indeed state

If Normalization is set to false, this also disables character range checking for numeric entities. As a result, character entities, such as , are allowed.

Considering entities like  can be generated by .NET's XML Serialization, is there any particular reason deserialization from a TextReader chose to set Normalization to true?

Posted by Luke Hutteman at December 18, 2003 1:18 PM

I believe there is a patch for this through MS. You have to contact them though, you cant just download it. Here is the link:

Link

Feel free to email me and I will send it to you.

Cleve Littlefield

Posted by Cleve Littlefield at December 18, 2003 1:28 PM

I encountered this error when I serialized the contents of a richtextbox into XML - if you read out .Rtf it contains a zero as last character, which will be converted to xml just fine but will throw on deserialisation.

Sam

Posted by Sam at December 18, 2003 4:05 PM

Will SharpReader support Atom in the near future?

Congrats for the nice app.

Posted by S�rgio Nunes at December 28, 2003 11:52 AM

When is an XmlTextReader not an XmlTextReader?

Trackback from The Chicken Coop at June 16, 2005 1:13 AM

This discussion has been closed. If you wish to contact me about this post, you can do so by email.