node icon indicating copy to clipboard operation
node copied to clipboard

buffer: validate UTF8 on fast path

Open ronag opened this issue 1 year ago • 19 comments

Fast API handles invalid UTF differently than the slow API.

Fixes: https://github.com/nodejs/node/issues/54521 PR-URL: https://github.com/nodejs/node/pull/54525

ronag avatar Aug 23 '24 14:08 ronag

Codecov Report

Attention: Patch coverage is 88.88889% with 8 lines in your changes missing coverage. Please review.

Project coverage is 87.60%. Comparing base (981c701) to head (0ae2ae1). Report is 252 commits behind head on main.

Files with missing lines Patch % Lines
src/node_buffer.cc 88.88% 3 Missing and 5 partials :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #54526      +/-   ##
==========================================
- Coverage   87.61%   87.60%   -0.01%     
==========================================
  Files         650      650              
  Lines      182835   182889      +54     
  Branches    35382    35389       +7     
==========================================
+ Hits       160185   160227      +42     
- Misses      15925    15927       +2     
- Partials     6725     6735      +10     
Files with missing lines Coverage Δ
src/node_buffer.cc 71.41% <88.88%> (+1.46%) :arrow_up:

... and 32 files with indirect coverage changes

codecov[bot] avatar Aug 23 '24 16:08 codecov[bot]

CI: https://ci.nodejs.org/job/node-test-pull-request/61382/

nodejs-github-bot avatar Aug 23 '24 16:08 nodejs-github-bot

CI: https://ci.nodejs.org/job/node-test-pull-request/61402/

nodejs-github-bot avatar Aug 24 '24 07:08 nodejs-github-bot

With the current commits, the following test will fail:

let i = 0;
const testStr = "\xc2\x80"; // when stored in 1-byte chars, this is a valid UTF-8 encoding
const expected = Buffer.from(testStr).toString("hex");
for(; i < 1_000_000; i++) {
  const buf = Buffer.from(testStr);
  const ashex = buf.toString("hex");
  if (ashex !== expected) {
    console.log(`Decoding changed in iteration ${i} when changing to FastWriteStringUTF8, got ${ashex}, expected ${expected}`);
    break;
  }
}

if(i<1_000_000) {
  console.error("FAILED after %d iterations",i);
} else
  console.log("PASSED after %d iterations",i);

The source string is stored as UTF16 (compressed to 1 byte), but still needs to be converted to UTF-8. The fast path can only be implemented as memcpy when only characters from the range 0-127 are present.

blexrob avatar Aug 24 '24 07:08 blexrob

The source string is stored as UTF16 (compressed to 1 byte), but still needs to be converted to UTF-8. The fast path can only be implemented as memcpy when only characters from the range 0-127 are present.

I thought FastOneByteString means that it's only one byte chars?

ronag avatar Aug 24 '24 14:08 ronag

< when only characters from the range 0-127 are present

@lemira does simdutf have this? validate_ascii?

ronag avatar Aug 24 '24 14:08 ronag

@blexrob I can't get it to fail with this PR given the following example:

  let i = 0;
  const testStr = '\xc2\x80'; // when stored in 1-byte chars, this is a valid UTF-8 encoding
  const expected = Buffer.from(testStr).toString('hex');
  for(; i < 10_000_000; i++) {
    const buf = Buffer.from(testStr);
    const ashex = buf.toString('hex');
    if (ashex !== expected) {
      assert(false);
    }
  }

ronag avatar Aug 24 '24 14:08 ronag

The source string is stored as UTF16 (compressed to 1 byte), but still needs to be converted to UTF-8. The fast path can only be implemented as memcpy when only characters from the range 0-127 are present.

I thought FastOneByteString means that it's only one byte chars?

FastOneByteString means it's a string that's encoded as ISO-8859-1. It doesn't mean that it's ASCII-only.

So, yeah, the current code is definitely buggy, and you'd want to verify that it's ASCII-only if you want to stick to the current approach. validate_ascii should be the right function for that.

addaleax avatar Aug 24 '24 15:08 addaleax

@lemire what do you think about htis:

  auto dst_ptr = dst_data + offset;
  auto src_ptr = reinterpret_cast<const uint8_t*>(src.data);

  uint64_t fallback = 0;

  size_t i = 0;
  for (; i + 8 <= size; i += 8) {
    uint64_t v;
    memcpy(&v, src_ptr + i, 8);
    fallback |= v;
    memcpy(dst_ptr + i, src_ptr + i, 8);
  }

  uint64_t v = 0;
  memcpy(&v, src_ptr + i, size - i);
  fallback |= v;
  memcpy(dst_ptr + i, src_ptr + i, size - i));

  if (fallback & UINT64_C(0x8080808080808080) != 0) {
    options.fallback = true;
    return 0;
  }

  return size;

as an alternative to:

  if (!simdutf::validate_ascii(src.data, size)) {
    options.fallback = true;
    return 0;
  }

  memcpy(dst_data + offset, src.data, size);

  return size;

ronag avatar Aug 24 '24 15:08 ronag

CI: https://ci.nodejs.org/job/node-test-pull-request/61423/

nodejs-github-bot avatar Aug 24 '24 20:08 nodejs-github-bot

CI: https://ci.nodejs.org/job/node-test-pull-request/61433/

nodejs-github-bot avatar Aug 25 '24 05:08 nodejs-github-bot

@targos PTAL. I have tried to incorporate your feedback. However, the changes are less trivial now and I would ask for some more in-depth reviews.

ronag avatar Aug 25 '24 11:08 ronag

I'm getting somewhat suspicious benchmark results. Not sure whether or not it's related to this PR:

make && node benchmark/compare.js --new ./out/Release/node --runs 15 --old /Users/robertnagy/.nvm/versions/node/v22.7.0/bin/node --filter buffer-write-string.js -- buffers | Rscript benchmark/compare.R

buffers/buffer-write-string.js n=1000000 len=2048 args='' encoding='ascii'                       0.45 %       ±2.88% ±3.89%  ±5.18%
buffers/buffer-write-string.js n=1000000 len=2048 args='' encoding='utf8'               ***     19.83 %       ±3.44% ±4.64%  ±6.18%
buffers/buffer-write-string.js n=1000000 len=2048 args='offset' encoding='ascii'        ***    -50.36 %       ±6.17% ±8.55% ±11.86%
buffers/buffer-write-string.js n=1000000 len=2048 args='offset' encoding='utf8'         ***    -76.18 %       ±2.89% ±3.99%  ±5.51%

Note how the regression only happens in the offset case and also that it applies to ascii which is still only a memcpy as it was in 22.7.

However, compared to 22.6 it looks normal:

buffers/buffer-write-string.js n=1000000 len=2048 args='' encoding='ascii'                         *     11.85 %       ±8.89% ±12.01% ±16.01%
buffers/buffer-write-string.js n=1000000 len=2048 args='' encoding='utf8'                        ***     18.35 %       ±7.82% ±10.58% ±14.15%
buffers/buffer-write-string.js n=1000000 len=2048 args='offset' encoding='ascii'                          0.44 %       ±7.44% ±10.06% ±13.44%
buffers/buffer-write-string.js n=1000000 len=2048 args='offset' encoding='utf8'                  ***     22.48 %       ±9.88% ±13.33% ±17.74%

ronag avatar Aug 26 '24 08:08 ronag

CI: https://ci.nodejs.org/job/node-test-pull-request/61474/

nodejs-github-bot avatar Aug 26 '24 08:08 nodejs-github-bot

CI: https://ci.nodejs.org/job/node-test-pull-request/61485/

nodejs-github-bot avatar Aug 26 '24 12:08 nodejs-github-bot

I also added a printf at the beginning of the WriteOneByteString method, and weirdly, it is never called with this test until the function is optimized (meaning it doesn't reach line 1539 in SlowWriteString).

targos avatar Aug 26 '24 17:08 targos

I also added a printf at the beginning of the WriteOneByteString method, and weirdly, it is never called with this test until the function is optimized (meaning it doesn't reach line 1539 in SlowWriteString).

Maybe it doesn't become a one byte string before optimization at which point it goes to the fast method?

ronag avatar Aug 26 '24 17:08 ronag

@blexrob @jasnell @addaleax @anonrig @lemire C++ experts PTAL

ronag avatar Aug 26 '24 17:08 ronag

CI: https://ci.nodejs.org/job/node-test-pull-request/61574/

nodejs-github-bot avatar Aug 28 '24 05:08 nodejs-github-bot

@ronag The implementation of this PR may still have some problems. I used git apply to apply these three buffer-related patches locally

https://github.com/nodejs/node/pull/54526 https://github.com/nodejs/node/pull/54588 https://github.com/nodejs/node/pull/54586

curl -fsSL https://github.com/nodejs/node/pull/54526.patch > 54526.patch

cd /path/to/nodejs/

git patch apply --3way 54526.patch

Then I used the compiled node.exe to build a local webpack project, and the following error appeared

ERROR in ./src/index.tsx + 710 modules
Unexpected non-whitespace character after JSON at position 5 (line 1 column 6)
SyntaxError: Unexpected non-whitespace character after JSON at position 5 (line 1 column 6)
    at JSON.parse (<anonymous>)
    at ConcatenationScope.matchModuleReference (T:\2\ddb\web\ci\node_modules\.pnpm\[email protected]\node_modules\webpack\lib\ConcatenationScope.js:145:13)
    at ConcatenatedModule.codeGeneration (T:\2\ddb\web\ci\node_modules\.pnpm\[email protected]\node_modules\webpack\lib\optimize\ConcatenatedModule.js:1253:41)
    ...

ERROR in ./src/model.ts + 1 modules
Unexpected token '
                  ', "
" is not valid JSON   0?F
SyntaxError: Unexpected token '
                               ', "
" is not valid JSON                0?F
    at JSON.parse (<anonymous>)
    at ConcatenationScope.matchModuleReference (T:\2\ddb\web\ci\node_modules\.pnpm\[email protected]\node_modules\webpack\lib\ConcatenationScope.js:145:13)
    at ConcatenatedModule.codeGeneration (T:\2\ddb\web\ci\node_modules\.pnpm\[email protected]\node_modules\webpack\lib\optimize\ConcatenatedModule.js:1253:41)
    ...

After that, I removed these three fast_write_string_xxx in node_buffer.cc, as shown below, recompiled, and the above problem disappeared.

image

I tried several times, and as long as I used the fast method, the build would fail

I suggest that you try to build some actual webpack projects containing multi-language text or internationalized large projects locally to fully test buffer write

ShenHongFei avatar Aug 30 '24 07:08 ShenHongFei

@ShenHongFei Any chance I can get a full repro? It's interesting since both the slow and fast path should behave similarly now. If you don't want to share publicly you can also mail me directly.

ronag avatar Aug 30 '24 07:08 ronag

@ronag I will try to find a reproducible project when I have some free time.

ShenHongFei avatar Aug 30 '24 07:08 ShenHongFei

@ShenHongFei can you try again with your project. I think I might have found the issue.

ronag avatar Aug 30 '24 10:08 ronag

For future reference:

Simple fuzzing:

const assert = require('node:assert')
const crypto = require('node:crypto')

for (let i = 0; i < 1e6; i++) {
  const len = Math.floor(Math.random() * 128)
  const buf = Buffer.allocUnsafe(len)
  crypto.getRandomValues(buf)

  for (let n = 0; n < 16; n++) {
    const start = Math.floor(Math.random() * len)
    const end = Math.min(len, start + Math.floor(Math.random() * (len - start)))
    const src = buf.subarray(start, end)
    const str = src.toString('utf8')

    let bufNew = Buffer.alloc(buf.length);
    let bufOld = Buffer.alloc(buf.length);

    bufNew = bufNew.subarray(0, bufNew.utf8Write(str, start))
    bufOld = bufOld.subarray(0, bufOld.utf8WriteLegacy(str, start)) // old implementation

    assert.deepStrictEqual(bufNew, bufOld)
  }
}

ronag avatar Aug 30 '24 10:08 ronag

https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1626/

ronag avatar Aug 30 '24 10:08 ronag

@ronag I tried it, and the problem has been fixed. 👍

ShenHongFei avatar Aug 30 '24 10:08 ShenHongFei

CI: https://ci.nodejs.org/job/node-test-pull-request/61707/

nodejs-github-bot avatar Aug 30 '24 13:08 nodejs-github-bot

Commit Queue failed
- Loading data for nodejs/node/pull/54526
✔  Done loading data for nodejs/node/pull/54526
----------------------------------- PR info ------------------------------------
Title      buffer: validate UTF8 on fast path  (#54526)
Author     Robert Nagy <[email protected]> (@ronag)
Branch     ronag:fast-utf8-2 -> nodejs:main
Labels     buffer, c++, author ready, needs-ci
Commits    1
 - buffer: validate UTF8 on fast path
Committers 1
 - Robert Nagy <[email protected]>
PR-URL: https://github.com/nodejs/node/pull/54526
Fixes: https://github.com/nodejs/node/issues/54521
Reviewed-By: Daniel Lemire <[email protected]>
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Paolo Insogna <[email protected]>
------------------------------ Generated metadata ------------------------------
PR-URL: https://github.com/nodejs/node/pull/54526
Fixes: https://github.com/nodejs/node/issues/54521
Reviewed-By: Daniel Lemire <[email protected]>
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Paolo Insogna <[email protected]>
--------------------------------------------------------------------------------
   ℹ  This PR was created on Fri, 23 Aug 2024 14:34:31 GMT
   ✔  Approvals: 6
   ✔  - Daniel Lemire (@lemire): https://github.com/nodejs/node/pull/54526#pullrequestreview-2257470621
   ✔  - Benjamin Gruenbaum (@benjamingr) (TSC): https://github.com/nodejs/node/pull/54526#pullrequestreview-2273156047
   ✔  - James M Snell (@jasnell) (TSC): https://github.com/nodejs/node/pull/54526#pullrequestreview-2258976699
   ✔  - Anna Henningsen (@addaleax): https://github.com/nodejs/node/pull/54526#pullrequestreview-2258958504
   ✔  - Yagiz Nizipli (@anonrig) (TSC): https://github.com/nodejs/node/pull/54526#pullrequestreview-2272343097
   ✔  - Paolo Insogna (@ShogunPanda) (TSC): https://github.com/nodejs/node/pull/54526#pullrequestreview-2260113366
   ✘  Last GitHub CI failed
   ℹ  Last Full PR CI on 2024-08-30T13:34:03Z: https://ci.nodejs.org/job/node-test-pull-request/61707/
   ℹ  Last Benchmark CI on 2024-08-30T10:28:48Z: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1626/
- Querying data for job/node-test-pull-request/61707/
   ✔  Last Jenkins CI successful
--------------------------------------------------------------------------------
   ✔  Aborted `git node land` session in /home/runner/work/node/node/.ncu
https://github.com/nodejs/node/actions/runs/10637813307

nodejs-github-bot avatar Aug 30 '24 19:08 nodejs-github-bot

CI: https://ci.nodejs.org/job/node-test-pull-request/61722/

nodejs-github-bot avatar Aug 30 '24 21:08 nodejs-github-bot

CI: https://ci.nodejs.org/job/node-test-pull-request/61734/

nodejs-github-bot avatar Aug 31 '24 08:08 nodejs-github-bot