protobuf-ts RangeError: premature EOF for Unicode character U+FEFF on start

Hello,

We've noticed that whenever the Unicode ZWNBSP character(U+FEFF) is received on start of the string of a message, it throws a silent error and omits the character in the decoded part. It seems that, this particular execution of TextDecoder.decode throws the mentioned error:

https://github.com/timostamm/protobuf-ts/blob/aaa63c7168cfae84a64e72c4f379017d3d1919b8/packages/runtime/src/binary-reader.ts#L245

I've created a repository to reproduce:

https://github.com/kivancguckiran/premature-eof-protobuf-ts

Outputted the charcodes from the result of the create operation and after fromBinary operation. If the U+FEFF character is in the start, it is ommited from decoded part.

Since it is Zero-Width-No-Break-Space, github preview hides the mentioned unicode character.

This line is: https://github.com/kivancguckiran/premature-eof-protobuf-ts/blob/main/index.ts#L4

Actually like this: resim

Thanks in advance.

Jun 30 '22 20:06 kivancguckiran

The text encoding/decoding is entirely up to your JS runtime.

You can run this in your browser or node and see the same result.

const input = 'test'; // "\ufefftest\ufeff"
const encoder = new TextEncoder();
const decoder = new TextDecoder('utf-8', { fatal: true });
const output = decoder.decode(encoder.encode(input)); // "test\ufeff"

I don't know for certain if this is expected for UTF-8/protobuf or a bug.

Jun 30 '22 21:06 jcready

Oh, duh. It's the BOM.

The name BYTE ORDER MARK is an alias for the original character name ZERO WIDTH NO-BREAK SPACE (ZWNBSP). With the introduction of U+2060 WORD JOINER, there's no longer a need to ever use U+FEFF for its ZWNSP effect, so from that point on, and with the availability of a formal alias, the name ZERO WIDTH NO-BREAK SPACE is no longer helpful, and we will use the alias here.

Using ignoreBOM: true in the TextDecoder constructor lets the string survives a round-trip:

const input = 'test'; // "\ufefftest\ufeff"
const encoder = new TextEncoder();
const decoder = new TextDecoder('utf-8', { fatal: true, ignoreBOM: true });
const output = decoder.decode(encoder.encode(input)); // "\ufefftest\ufeff"

Jul 01 '22 00:07 jcready

Thanks for the heads up! Do you think setting this option to true will merge into main? Since if I have analyzed the code correctly, TextEncoder instantiated on the fly within the library.

https://github.com/timostamm/protobuf-ts/blob/aaa63c7168cfae84a64e72c4f379017d3d1919b8/packages/runtime/src/binary-reader.ts#L50

Or is there another solution you are pointing out which I'm not seeing? Like if I can set the TextDecoder used in library somehow.

Jul 01 '22 00:07 kivancguckiran

Thought I can do this:

Test.fromBinary(data, {
  readerFactory: (bytes) => new BinaryReader(bytes, new TextDecoder('utf-8', {
    fatal: true,
    ignoreBOM: true,
  })),
});

But doing this everywhere fromBinary is called seems overkill.

Jul 01 '22 08:07 kivancguckiran

I can't seem to find any information about whether or not the BOM should be ignored or not in protobuf strings. They're definitely ignored when protoc is reading a proto file to compile, but as far as the actual runtimes go I see no tests regarding the BOM in string field values.

I think the correct thing would be to update protobuf-ts to use the ignoreBOM setting, but I can't be certain without seeing an existing test or docs. For an immediate workaround what you have is basically what I would've recommended, but you should only need to create the TextDecoder instance once.

// shared-binary-read-options.ts
import { BinaryReader, BinaryReadOptions } from '@protobuf-ts/runtime';

const textDecoder = new TextDecoder('utf-8', { fatal: true, ignoreBOM: true });
export const binaryReadOptions: Partial<BinaryReadOptions> = {
    readerFactory: (bytes) => new BinaryReader(bytes, textDecoder)
};

// some-other-file.ts
import { binaryReadOptions } from './shared-binary-read-options';

// Later
Test.fromBinary(data, binaryReadOptions);

I wouldn't recommend the following approach, but you can monkey-patch the BinaryReader prototype so that you can avoid needing to import and pass the options everywhere. Just note that you will need to execute (import) this code once before calling fromBinary() to be effective.

// monkey-patch-protobuf-ts-binary-reader.ts
import { BinaryReader } from '@protobuf-ts/runtime';

const textDecoder = new TextDecoder('utf-8', { fatal: true, ignoreBOM: true });

function monkeyPatchedString(): string {
    return textDecoder.decode(this.bytes());
}

// @ts-ignore
BinaryReader.prototype.string = monkeyPatchedString;

Jul 01 '22 12:07 jcready

Thanks to both of you for looking into this!

ignoreBOM is an option in Node.js. The docs say:

When true, the TextDecoder will include the byte order mark in the decoded result. When false, the byte order mark will be removed from the output. This option is only used when encoding is 'utf-8', 'utf-16be', or 'utf-16le'. Default: false.

Keeping the byte order mark in the decoded string seems reasonable to me, if only for reproducible encoding roundtrips. If this passes the conformance suite, it should be fine.

However, I'm curious about the error being thrown. @kivancguckiran, can you provide some details? If I run your example, I see that the BOM is stripped, but I don't see an error:

➜  ~ node -v
v16.13.2
➜  ~ npx esbuild --bundle index.ts | node
Result of Test.create()
65279, 116, 101, 115, 116, 65279, 

Result of Test.fromBinary()
116, 101, 115, 116, 65279,

Jul 07 '22 08:07 timostamm

Hello @timostamm!

You are absolutely right. This does not raise an error. I was trying to debug it inside node_modules at that time with protobuf-ts js modules, and my error was I was executing internally this.bytes() twice and that's when I was getting the error in binary-reader.js.

So no errors at all. But yes, the first BOM is omitted as you have noticed.

We are currently wrapping readerFactory like this to workaround the issue:

const decoder = new TextDecoder('utf-8', {
  fatal: true,
  ignoreBOM: true,
});

const deserialized = module?.fromBinary(e.data, {
  readerFactory: (bytes) => new BinaryReader(bytes, decoder),
});

Like suggested by @jcready. It would be nice to see this in main. If it does not cause any issues.

Thanks.

Jul 07 '22 09:07 kivancguckiran

We pass the conformance tests with ignoreBOM: true. Merged the addition in https://github.com/timostamm/protobuf-ts/pull/362.

Aug 17 '22 13:08 timostamm

Released in v2.8.0.

Aug 17 '22 13:08 timostamm

protobuf-ts protobuf-ts copied to clipboard

RangeError: premature EOF for Unicode character U+FEFF on start

protobuf-ts
protobuf-ts copied to clipboard