lua-struct icon indicating copy to clipboard operation
lua-struct copied to clipboard

Supporting wide character (UTF-16) strings

Open EnTerr opened this issue 9 years ago • 2 comments

I faced the need to unpack zero-terminated strings where each character is 2 bytes in big-endian (hi-lo) order, even as most all strings are ASCII range. So what i added to unpack is:

    elseif opt == 'S' then   -- wide-character string, hi-lo

      local str = ''
      while true do
        local wch = stream:byte(iterator) + 256 * stream:byte(iterator + 1)
        iterator = iterator + 2
        if wch == 0 then
          break
        end        
        str = str .. (wch < 128 and string.char(wch) or '~')
      end
      table.insert(vars, str)      

    elseif

This is the most controversial/unfinished of my mods, since it assumes little-endian encoding (many apps do lo-hi, even as the default per RFC-2781 is big endian - see https://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes ). In addition i don't check for https://en.wikipedia.org/wiki/Byte_order_mark .

Nor do i handle correctly code points over 255. Which is a puzzle, how to correctly handle that in Lua? I am guessing the right thing would be to convert to UTF-8 for the internal string (which matches ASCII for <128). In any case - not production ready but existing need.

EnTerr avatar Sep 07 '15 21:09 EnTerr

@EnTerr Hello, did you add anything else too ? I added your code to the unpack function by still can't decode utf 16 strings...

NicoAdrian avatar Jun 20 '19 09:06 NicoAdrian

Not a solution, but in Lua, UTF code points are simply split into 8-bit values. What they are depends on the UTF encoding you're using (UTF-8/UTF-16BE/UTF-16LE), so you just have to know what you're working with to anticipate the byte order in the string.

randomeizer avatar Dec 04 '19 05:12 randomeizer