protobuf-es icon indicating copy to clipboard operation
protobuf-es copied to clipboard

Use String.prototype.isWellFormed() once it's widely available

Open timostamm opened this issue 5 months ago • 0 comments

Protobuf requires strings to be valid UTF-8. When serializing, we check strings via encodeUriComponent, which is far from ideal for performance.

String.prototype.isWellFormed is a suitable alternative. On Node.js, it shows significantly better performance, especially for longer strings:

$ node --version
v24.5.0
$ node ./t.ts encodeURIComponent 10
node ./t.ts isWellFormed 10
node ./t.ts encodeURIComponent 100
node ./t.ts isWellFormed 100
node ./t.ts encodeURIComponent 1000
node ./t.ts isWellFormed 1000

encodeURIComponent with string length 10: 77.23291699999999 ms
isWellFormed with string length 10: 16.621958 ms
encodeURIComponent with string length 100: 34.113417 ms
isWellFormed with string length 100: 3.5224170000000044 ms
encodeURIComponent with string length 1000: 29.431917 ms
isWellFormed with string length 1000: 0.5034169999999989 ms
Benchmark script
// t.ts
const type = process.argv[2];
let checkUtf8: (str: string) => boolean;
switch (type) {
  case "encodeURIComponent":
    checkUtf8 = function checkUtf8(str: string) {
      try {
        encodeURIComponent(str);
        return true;
      } catch (_) {
        return false;
      }
    };
    break;
  case "isWellFormed":
    checkUtf8 = function checkUtf8(str: string) {
      // @ts-expect-error
      return str.isWellFormed();
    };
    break;
  default:
    throw new Error("Unknown type: " + type);
}

const strLen = process.argv[3];
let strings: string[];
switch (strLen) {
  case "10":
    strings = new Array(1_000_000).fill("012345678¼");
    break;
  case "100":
    strings = new Array(100_000).fill("012345678¼".repeat(10));
    break;
  case "1000":
    strings = new Array(10_000).fill("012345678¼".repeat(100));
    break;
  default:
    throw new Error("Unknown strLen: " + strLen);
}

const start = performance.now();
for (const str of strings) {
  if (!checkUtf8(str) ) {
    throw new Error(`Unexpected invalid utf-8 ${str}`);
  }
}
const elapsed = performance.now() - start;
console.log(`${type} with string length ${strLen}: ${elapsed} ms`);

isWellFormed is not widely available yet, but it will be in April 2026. See the definition of "widely available" on MDN.

Related to: https://github.com/bufbuild/protobuf-es/issues/333

timostamm avatar Sep 29 '25 16:09 timostamm