How to convert UTF-8 data into a String

If you have an array of UTF-8 bytes and want to convert them into a String then the following may help…

As you may know, UTF-8 is a way of encoding every character in the Unicode character set using a variable number of byte per character. For example, the letter A just needs 1 byte but the character 끰 requires 2 bytes (b0 70).

In VB.Net this is pretty straight forward…

Start out with the 2 bytes in an array…

Dim bytes() As Byte = New Byte() {&HB0, &H70}
Dim str As String = System.Text.Encoding.UTF8.GetString(bytes)

And now str contains 끰

But what if you started out with an IntPtr to an un-managed C style string?

In that case you would need to marshal the data into a byte array and then do the above, as in the following funciton…

   Public Function ConvertUTF8IntPtrtoString(ByVal ptr As System.IntPtr) As String
        Dim l As Integer
        l = System.Runtime.InteropServices.Marshal.PtrToStringAnsi(ptr).Length
        Dim utf8data(l) As Byte
        System.Runtime.InteropServices.Marshal.Copy(ptr, utf8data, 0, l)
        Return System.Text.Encoding.UTF8.GetString(utf8data)
    End Function

 

And the following C++ function will do the same in MFC:

 

int CSampleBarcodeReaderDlg::ConvertUTF8Value(LPCSTR in, CString &out)
{
   int l = MultiByteToWideChar(CP_UTF8, 0, in, -1, NULL, 0);
   wchar_t *str = new wchar_t[l];
   int r = MultiByteToWideChar(CP_UTF8, 0, in, -1, str, l);
   out = str;
   delete str ;
   return r ;
}

 

A frustrating twist on the above is when you have a representation of UTF-8 already in a String and would like to convert it to a normal String. There are probably smarter ways of doing this but here’s a take on it…

In this example utf8 starts out a a string that happens to contain a representation of UTF-8 data. This is converted, character by character to a byte array and then back to a String using UTF-8 encoding. In this case str ends up with the value ?.

 

        Dim utf8 As String = "ç?³"
        Dim ch() As Char = utf8.ToCharArray()
        Dim bytes(ch.Length) As Byte
        For i = 0 To (ch.Length - 1)
            bytes(i) = System.Convert.ToByte(ch(i))
        Next
        Dim str As String = System.Text.Encoding.UTF8.GetString(bytes)