Double encoded UTF-8 strings in C#

This article shows how to convert a string that has been double encoded using UTF-8.

For example, say you have the string Müller instead of the string Müller.

How did it happen?

The letter ü is encoded in UTF-8 as 2 bytes: 195 and 188

If you encoded the bytes again then the 195 converts to 195 and 131 which is the Ã

And the 188 converts to 194 and 188 which is the ¼

How can it be converted back to what it should look like?

The following function will convert the double encoded string back to the original value…

 

private string decodeUTF8String(String utf8Str)

{

System.Text.Encoding iso = System.Text.Encoding.GetEncoding(“ISO-8859-1”);

System.Text.Encoding utf8 = System.Text.Encoding.UTF8;

byte[] utfBytes = utf8.GetBytes(utf8Str);

byte[] isoBytes = System.Text.Encoding.Convert(utf8, iso, utfBytes);

System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding();

return encoding.GetString(isoBytes);

}

 

How does this relate to barcodes?

Some PDF-417 barcodes may contain data that has already been encoded using UTF-8 and when we read the barcode we encode it again using UTF-8, giving a double encoded string. In the win32 DLL interface the work-around is simply to set the Encoding property to 0, but the above is necessary in the .Net interface.