smart-buffer
smart-buffer copied to clipboard
utf16le + readStringNT compatibility
Consider:
const buffer = SmartBuffer.fromSize(0, 'utf16le');
buffer.writeStringNT('hello');
const output = buffer.readStringNT();
We'd expect output to be "hello", but it's current '', due to:
https://github.com/JoshGlazebrook/smart-buffer/blob/d35c0ce6e253e7c963553a4092cb73b711caafaa/src/smartbuffer.ts#L685-L690
The buffer (after the write) looks like this:
68 00 65 00 6c 00 6c 00 6f 00 00
^^ perceived NT ^^ actual NT
I'm not sure if any encodings other than utf16le suffer from this, but to fix it the i++ should be changed to i += 2 for utf16le.
Hmm this one is interesting. I'll have to check the other possible encodings and see if any others do this.
So I looked into this a bit more, utf-16 is variable length, and a single character is represented by either 2 bytes or 4 bytes. So even the fix above will only work for certain characters.
I think the solution here is to just throw an error if attempting to write or read a null terminated string using utf16 or ucs2.
https://nodejs.org/api/buffer.html#buffer_buffers_and_character_encodings
https://en.wikipedia.org/wiki/Null-terminated_string#Character_encodings
Technically it looks like this isn't possible with even utf8, but it works for most characters.
I might be misremembering my UTF studies, but I'm pretty sure that continuation bytes (when code points extend beyond a single byte for utf8, or two bytes for utf16) cannot be 0, they have to be a negative number when expressed as a signed value (e.g. signed byte for utf8, signed short for utf16), that is the first bit must be a 1.
There is one question to ask here: is the null terminator to be interpreted as a character that is part of the string's encoding?
If yes, then the null terminator would be as it is in the string: 2 bytes, meaning you'd be checking for two consecutive null bytes at an even offset from the starting read offset.
If not, there is no way to safely detect a null terminator in UTF-16, as either byte of a code point may be null, so there are no guarantees when checking individual bytes.
Thus, I would say it's the logical decision to interpret the null terminator as a character in the string's encoding.