smart-buffer icon indicating copy to clipboard operation
smart-buffer copied to clipboard

utf16le + readStringNT compatibility

Open azz opened this issue 5 years ago • 4 comments

Consider:

const buffer = SmartBuffer.fromSize(0, 'utf16le');
buffer.writeStringNT('hello');
const output = buffer.readStringNT();

We'd expect output to be "hello", but it's current '', due to:

https://github.com/JoshGlazebrook/smart-buffer/blob/d35c0ce6e253e7c963553a4092cb73b711caafaa/src/smartbuffer.ts#L685-L690

The buffer (after the write) looks like this:

68 00 65 00 6c 00 6c 00 6f 00 00
   ^^ perceived NT            ^^ actual NT

I'm not sure if any encodings other than utf16le suffer from this, but to fix it the i++ should be changed to i += 2 for utf16le.

azz avatar May 19 '19 12:05 azz

Hmm this one is interesting. I'll have to check the other possible encodings and see if any others do this.

JoshGlazebrook avatar May 22 '19 17:05 JoshGlazebrook

So I looked into this a bit more, utf-16 is variable length, and a single character is represented by either 2 bytes or 4 bytes. So even the fix above will only work for certain characters.

I think the solution here is to just throw an error if attempting to write or read a null terminated string using utf16 or ucs2.

https://nodejs.org/api/buffer.html#buffer_buffers_and_character_encodings

https://en.wikipedia.org/wiki/Null-terminated_string#Character_encodings

Technically it looks like this isn't possible with even utf8, but it works for most characters.

JoshGlazebrook avatar Jul 22 '19 04:07 JoshGlazebrook

I might be misremembering my UTF studies, but I'm pretty sure that continuation bytes (when code points extend beyond a single byte for utf8, or two bytes for utf16) cannot be 0, they have to be a negative number when expressed as a signed value (e.g. signed byte for utf8, signed short for utf16), that is the first bit must be a 1.

azz avatar Jul 22 '19 08:07 azz

There is one question to ask here: is the null terminator to be interpreted as a character that is part of the string's encoding?

If yes, then the null terminator would be as it is in the string: 2 bytes, meaning you'd be checking for two consecutive null bytes at an even offset from the starting read offset.

If not, there is no way to safely detect a null terminator in UTF-16, as either byte of a code point may be null, so there are no guarantees when checking individual bytes.

Thus, I would say it's the logical decision to interpret the null terminator as a character in the string's encoding.

exodustx0 avatar Jan 24 '21 15:01 exodustx0