kaitai_struct Strz type support for UTF-16 and UTF-32

This issue is very similar to #13 and there is a lot of relevant discussion there.

Observed

If strz is used with UTF-16 or UTF-32 then a single null byte is enough to result in termination rather than 2 (U16) or 4 (U32) bytes.

Expected

If the encoding is given as UTF-16 or UTF-32 then the strz parse only terminates when the corresponding number of consecutive null bytes appear.

To describe the use case, I am writing a descriptor for the output of an internal Windows tool which reserves a fixed length for the string then null terminates it. While this practice is a bit inefficient and I can't point to any widely used formats that do the same I suspect that it is not uncommon for Windows programs since it is analogous to doing so with UTF-8 in a cross-platform or Unix scenario.

On Windows, an example C struct might look like this:

struct TextBlob
{
	UCHAR DataPrefix[n];
	WCHAR WideString[64];
	UCHAR DataSuffix[n];
};

For an example data blob and ksy file that that minimally reproduce the issue (at least with the web IDE) please see the following zip: utf16_test.zip

Jun 20 '17 12:06 dreckard

Ok, to collect all the arguments that were mentioned in previous discussions on this topic in one place:

Implementing "read until a combination of multiple bytes is encountered" is non-trivial. For example, none of our supported languages have such a method in stdlib (but note almost everyone includes something like "read until a byte is encountered").
Actually, "read until a sequence of 00 00 is encountered" is plain wrong for this case. Namely, "01 00 00 02 00 00" should be read as two UTF-16LE chars "\u0001\u0200", not as just "\u0001". I.e. "00 00" much appear as whole character, 2-byte aligned.
It's totally possible (although not really usual) to get this kind of problem outside of C + Windows world. Although majority of applications don't seem to use strings of zero-terminated wide chars family of functions, stuff like wcslen, wcscat, etc (akin to much more popular strlen, strcat, etc) technically exists in C since C99 standard.
No popular (and publically available) format have been found using this so far.
Looks like this issue is only relevant to UTF-16 representation of strings, and probably 99.9% of the time it is used on actually a fixed string to trim remaining garbage, i.e. it should be solved around bytes_terminate function, but probably applied in somewhat different manner.

Current archetypical application generates something like that:

foo = bytes_to_str(
    bytes_terminate(
        bytes_strip_right(
            io.read_bytes(20),
            43 // padding byte
        ),
        64, // terminator byte
        false
    ),
    "UTF-8"
);

Obviously, applying a function like "bytes_terminate" which operates on strings (not just byte arrays) requires us to convert byte array to string first with "bytes_to_str", i.e. executing something like that:

foo = str_terminate(
    bytes_to_str(
        io.read_bytes(20),
    ),
    0x0000, // terminator char
);

The big catch is that actually trailing garbage might have something that will be invalid in chosen encoding. Unfortunately, contrary to popular opinion, this is true even for UTF16. For example, if we're working with C++ or PHP (and iconv-based implementation, which converts everything that should be treated as string internally to UTF-8), this string will trigger an error on conversion:

61 00|00 00|00 D8 61 00

Although we would expect result a (as 0x61 is ASCII code for "a", and then string terminates with 00 00), it will trigger something like InvalidByteSequenceError. This is because 00 D8 61 00 is not a valid UTF16 character: 00 D8 is a lead of surrogate pair, and it must be followed by tail of surrogate pair, and 61 00 is illegal here.

Jun 20 '17 14:06 GreyCat

No popular (and publically available) format have been found using this so far.

Windows version info resources, found in executables and .res files, use them.

Jun 20 '17 21:06 sirg3

VS_VERSIONINFO per se has no variable-length strings. However, it includes StringFileInfo → StringTable → String, which actually includes them.

What's even more peculiar, actually, it seems that there are literally double-null strings to store two-level lists: https://blogs.msdn.microsoft.com/oldnewthing/20091008-00/?p=16443 — I believe they effectively become quad-null terminated strings when laid out in UTF-16. Can anyone confirm/deny that?

Jun 21 '17 06:06 GreyCat

Good clarification on the byte alignment requirement. I believe that is correct regarding the quad-null termination for WCHAR string lists. The only other documentation I have seen is in some of the APIs that return data in this style such as GetLogicalDriveStrings, but I don't see any other way it could be interpreted.

The Windows KUSER_SHARED_DATA struct's NtSystemRoot is another example of a null terminated UTF-16 (single) string. The struct is fairly commonly parsed in memory dump analysis at least.

typedef struct _KUSER_SHARED_DATA
{
     ULONG TickCountLowDeprecated;
     ULONG TickCountMultiplier;
     KSYSTEM_TIME InterruptTime;
     KSYSTEM_TIME SystemTime;
     KSYSTEM_TIME TimeZoneBias;
     WORD ImageNumberLow;
     WORD ImageNumberHigh;
     WCHAR NtSystemRoot[260];
     ULONG MaxStackTraceDepth;
     ...
}

http://www.geoffchappell.com/studies/windows/km/ntoskrnl/structs/kuser_shared_data.htm

Jun 22 '17 10:06 dreckard

I've implemented simple Windows resource parser as an excercise: https://github.com/kaitai-io/windows_resource_file.ksy/blob/master/windows_resource_file.ksy — now it uses a hack to parse strings (especially because there's an extra twist there — same byte space is used to designate numeric and string IDs).

While doing so, I've noticed that there are at least 2 very distinct cases we're talking here:

Pure "read until double 00 terminator, convert to string" — encountered at least in NAME and TYPE fields of
"Read fixed number of bytes, then trim it using double 00 terminator logic, then convert to string"

That means my original proposal is wrong. We indeed need both (1) and (2) implemented, not only (2).

Jun 22 '17 12:06 GreyCat

Construct has both (fixed-length string with filler, and c-string). Admittedly it took me 2 years to get it implemented. I have never seen a protocol that actually used the first either, and I cant even imagine why would anyone design such a protocol in the first place, but it was requested on Construct forum more than once, so I guess someone out there actually needs it, so here we are.

@GreyCat If you give me a go, I will add 2 methods to Python runtime to effectuate this, but you would need to update the compiler (translator). I could update C# then too. Restriction is, these 2 methods need to know what encoding it is, or rather what is the unit size (2 for UTF16, 4 for UTF32, 1 for UTF8).

Feb 07 '18 23:02 arekbulski

Let's start with inventing some ksy syntax that covers all the cases discussed in this ticket.

Feb 08 '18 01:02 GreyCat

Fixed string: "type: str, size: N, terminator: 0, encoding: utf16, [unitsize: T=2]" (EDIT, added terminator)

Reads N bytes, then successively strips last T bytes if they are null, down to empty string. If encoding is recognizable like UTF*, unitsize can be inferred. N must be multiple of T.

CString: "type: str, terminator: 0, encoding: utf16, [unitsize: T=2]" CString: "type: strz, encoding: utf16, [unitsize: T=2]"

Reads T bytes at a time, until that chunk is all null bytes. If first chunk is nulls, its an empty string. If encoding is recognizable like UTF*, unitsize can be inferred. Terminator other than 0 should be compile error, because at least UTF encodings support only one way of terminating it.

By recognizable encodings I mean those:

construct.possiblestringencodings = {'U16': 2, 'utf_8': 1, 'utf32': 4, 'utf_32_le': 4, 'utf8': 1, 'utf_32_be': 4, 'utf_32': 4, 'utf_16_be': 2, 'U32': 4, 'utf16': 2, 'ascii': 1, 'utf_16': 2, 'utf_16_le': 2, 'U8': 1}¶

Feb 11 '18 03:02 arekbulski

There is also a problem when both size and terminator are used.

meta:
  id: test1
seq:
  - id: value
    type: str
    terminator: 0
    size: 10
    encoding: utf-8

Compiles into following. Problem is, bytes_terminate only supports single ~characters~ bytes.

(KaitaiStream.bytes_terminate(self._io.read_bytes(10), 0, False)).decode(u"utf-8")

Extract from the runtime:

    def bytes_terminate(data, term, include_term):
        new_len = 0
        max_len = len(data)
        while new_len < max_len and data[new_len] != term:  #<--- indexing not slicing
            new_len += 1   #<--- non variable
        if include_term and new_len < max_len:
            new_len += 1
        return data[:new_len]

Mar 04 '18 18:03 arekbulski

bytes_terminate, as its name suggest does not support any characters at all, only bytes. What problem do you see here?

Mar 04 '18 19:03 GreyCat

To support UTF16/32, bytes_terminate would need to slice data[newlen:newlen+unit] and then increment like newlen += unit. In other words, current implementation would not support UTF16/32. Thats what this issue is about. And by characters I meant single bytes, sorry.

Mar 04 '18 19:03 arekbulski

It doesn't, that's true. My main concern here is that technically we should not deal with bytes at all: if we're dealing with encodings like UTF16, for example, in C++, that would call for char16_t type instead of uint8_t and relevant reinterpret cast.

If we'll stick with bytes-centric implementation, however, from runtime's point of view, we can probably cover all possible cases by specifying something like byte_terminate_multi, something like that:

byte[] bytesTerminateMulti(byte[] bytes, byte[] term, int unitSize, boolean includeTerm)

and the same thing about bytes_strip_right_multi:

public static byte[] bytesStripRight(byte[] bytes, int unitSize, byte[] padBytes) {

Mar 04 '18 23:03 GreyCat

Good idea, it would be better to have a separate (multi) method instead of generalizing the existing (single) method, because multi will have less performance than single.

Would unitsize parameter be actually needed? I think its just len(term) and len(padbytes).

Mar 05 '18 00:03 arekbulski

Specifying unitSize is need because we need so support both aligned 00 00 and unaligned variants, i.e. something like: 30 00 40 00 50 50 50 00 00 12 34 56 should 30 00 40 05 50 50 50 (or O\0@\0P in ASCII) after termination, if unitSize is 1 and term is [0, 0].

Also, I think that this place is more than anything warrants plenty of language-specific APIs. For example, in C/C++, you don't want to pass an array around, you'd want one uint16_t integer and then one just reinterpret_casts parts of the array and compares them to that integer. This obviously doesn't make much sense for higher-level languages, in which "parsing 2 bytes as integer" is an expensive operation.

Mar 06 '18 11:03 GreyCat

I thought it should be only unit-aligned version. The usecase for aligned would be obviously strings, but what would be the usecase for unaligned?

Mar 06 '18 12:03 arekbulski

BTW, do we really have to carve strzs ourselves for C++? I mean that string functions, like basic_string ctor from a pointer, expect certain termination, so can't we just pass a pointer, and then shift to the size() of the string parsed by c++?

Mar 07 '18 06:03 KOLANICH

It probably only works with single-null terminator, right? This issue is about supporting UTF16 and 32.

Mar 07 '18 06:03 arekbulski

wchar_t is usually (though it is not standardised) a utf-16 encoded code point. Wstring is a string of wchar_ts. Arrays of wchar_t are assummed to be terminated with 2 null bytes.

Mar 07 '18 10:03 KOLANICH

So someone uses it with UTF32 encoding, then what?

Mar 07 '18 10:03 arekbulski

wchar_t is usually (though it is not standardised) a utf-16 encoded code point.

I'd say that from practical point of view, the only system that uses UTF16 wide chars is Windows. Virtually everyone else use UTF32 there.

Mar 07 '18 15:03 GreyCat

Python unicode strings use 1/2/4 bytes per character, depending on actual text.

Mar 07 '18 15:03 arekbulski

The question @KOLANICH raised was about wchar_t in C++. Python, Java, C#, Go and everything else is non-relevant to that.

Let's get back to original topic.

The usecase for aligned would be obviously strings, but what would be the usecase for unaligned?

I believe someone further above this issue provide some examples why that would be useful. I recall some strings in UTF16 wanted 4-byte [0, 0, 0, 0] terminator and stuff like that.

Mar 07 '18 15:03 GreyCat

Any update on this issue? Seems like there is a path forward. Need any help?

This is currently causing me some grief.

Dec 04 '21 22:12 kdumontnu

It is definetly a more common ocurrence than what it is assumed. Trying to make a struct for the subtitle files of a game, which are UTF-16 strings that end on a terminator.

May 28 '23 18:05 ricky-daniel13

This is hardly a good solution, but it is possible to parse zero terminated utf-16 (and utf-32) strings by abusing the repeat-until feature.

The basic idea is to just parse it as an array of u2/u4 with a repeat-until condition of _ == 0. After that you can use the length of that array to parse the actual string in an instance. This trick does however require you to be able to write an expression to calculate the start address, which can become tricky if the string is located behind other variable-size structures or within a type definition.

seq:
  - id: before_string
    type: u4
  - id: string_chars
    type: u2
    repeat: until
    repeat-until: _ == 0
  - id: after_string
    type: u4
instances:
  my_string:
    pos: 0x04
    type: str
    encoding: utf-16
    size: _root.string_chars.size * 2

On the issue of relevance: I work with a lot of obscure game engines and zero terminated utf-16 strings are not that uncommon there, for example for the file name tables in their archive formats.

Jun 09 '23 23:06 AtomCrafty

@AtomCrafty:

This is hardly a good solution, but it is possible to parse zero terminated utf-16 (and utf-32) strings by abusing the repeat-until feature.

No, I think this is actually quite a good workaround (and I don't think this is "abusing", it reflects well how a null-terminated UTF-16 string is parsed and it's as close to a "native" solution as you can get in pure Kaitai Struct). It can be also extracted to a reusable type like this:

meta:
  id: strz_utf_16_test
  endian: le
seq:
  - id: before_string
    type: u4
  - id: my_string
    type: strz_utf_16
  - id: after_string
    type: u4
types:
  strz_utf_16:
    seq:
      - id: value
        size: 2 * (code_units.size - 1)
        type: str
        encoding: utf-16le
      - id: term
        type: u2
        valid: 0
    instances:
      code_units:
        pos: _io.pos
        type: u2
        repeat: until
        repeat-until: _ == 0

Parsing the string in seq and performing the lookahead in instances is a bit more serialization-friendly. You can set the value string normally (though it must not contain any U+0000 code points particularly if it comes from a user, otherwise it won't be read back properly), ensure that the code_units list has a correct length (the contents won't matter) and then disable writing of the code_units instance (see https://doc.kaitai.io/serialization.html#_parse_instances).

Python code to handle serialization of the strz_utf_16_test.ksy spec above

from kaitaistruct import KaitaiStream
from strz_utf_16_test import StrzUtf16Test

r = StrzUtf16Test()
r.before_string = 803_200_000

my_s = StrzUtf16Test.StrzUtf16(None, r, r._root)
my_s.value = "Hello \U0001F44B"
my_s.term = 0
my_s.code_units = [0x0000] * (len(my_s.value.encode('utf-16le')) // 2 + 1)
my_s.code_units__to_write = False
my_s._check()

r.my_string = my_s
r.after_string = 48_160_000
r._check()

import io
_io = KaitaiStream(io.BytesIO(bytearray(4 + 2 * len(r.my_string.code_units) + 4)))
r._write(_io)

output = _io.to_byte_array()
print(output.hex(' '))  # 00 dc df 2f 48 00 65 00 6c 00 6c 00 6f 00 20 00 3d d8 4b dc 00 00 00 dd de 02

This trick does however require you to be able to write an expression to calculate the start address, which can become tricky if the string is located behind other variable-size structures or within a type definition.

This limitation can be eliminated using _io.pos (as you can see in the above .ksy snippet; see also https://doc.kaitai.io/user_guide.html#_streams). Since the code_units instance is invoked just before value can be parsed, the _io.pos expression gives the current position relative to the current stream that can be directly used in pos. Also, instances are cached, so even if code_units is requested later (when _io.pos would probably give a different value), it's not parsed again - the getter simply returns the cached value from the first invocation.

Jun 10 '23 14:06 generalmimon

@generalmimon

Well you know your way around Kaitai way better than me, so I'll take your word for it being idiomatic ^^ I did not know about _io.pos or that instances can be evaluated before the main sequence is finished. Your variation makes the whole thing look much cleaner than what I cobbled together and it has the major advantage of being plug-and-play. I'm pretty happy with that as a solution.

Jun 10 '23 16:06 AtomCrafty

Another implementation I've used in the past to get UTF-16 support. Using the as_string instance does the conversion from UTF-16LE to the native encoding.

unicode_16:
  seq:
    - id: first
      size: 0
      if: start_ >= 0
    - id: c
      type: u2
      repeat: until
      repeat-until: _ == 0
    - id: last
      size: 0
      if: end_ >= 0
  instances:
    start_:
      value: _io.pos
    end_:
      value: _io.pos
    as_string:
      pos: start_
      type: str
      size: end_ - start_ - 2
      encoding: UTF-16LE

Feb 09 '24 17:02 t0xicCode