buffer: validate UTF8 on fast path
Fast API handles invalid UTF differently than the slow API.
Fixes: https://github.com/nodejs/node/issues/54521 PR-URL: https://github.com/nodejs/node/pull/54525
Codecov Report
Attention: Patch coverage is 88.88889% with 8 lines in your changes missing coverage. Please review.
Project coverage is 87.60%. Comparing base (
981c701) to head (0ae2ae1). Report is 252 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/node_buffer.cc | 88.88% | 3 Missing and 5 partials :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #54526 +/- ##
==========================================
- Coverage 87.61% 87.60% -0.01%
==========================================
Files 650 650
Lines 182835 182889 +54
Branches 35382 35389 +7
==========================================
+ Hits 160185 160227 +42
- Misses 15925 15927 +2
- Partials 6725 6735 +10
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/node_buffer.cc | 71.41% <88.88%> (+1.46%) |
:arrow_up: |
CI: https://ci.nodejs.org/job/node-test-pull-request/61382/
CI: https://ci.nodejs.org/job/node-test-pull-request/61402/
With the current commits, the following test will fail:
let i = 0;
const testStr = "\xc2\x80"; // when stored in 1-byte chars, this is a valid UTF-8 encoding
const expected = Buffer.from(testStr).toString("hex");
for(; i < 1_000_000; i++) {
const buf = Buffer.from(testStr);
const ashex = buf.toString("hex");
if (ashex !== expected) {
console.log(`Decoding changed in iteration ${i} when changing to FastWriteStringUTF8, got ${ashex}, expected ${expected}`);
break;
}
}
if(i<1_000_000) {
console.error("FAILED after %d iterations",i);
} else
console.log("PASSED after %d iterations",i);
The source string is stored as UTF16 (compressed to 1 byte), but still needs to be converted to UTF-8. The fast path can only be implemented as memcpy when only characters from the range 0-127 are present.
The source string is stored as UTF16 (compressed to 1 byte), but still needs to be converted to UTF-8. The fast path can only be implemented as memcpy when only characters from the range 0-127 are present.
I thought FastOneByteString means that it's only one byte chars?
< when only characters from the range 0-127 are present
@lemira does simdutf have this? validate_ascii?
@blexrob I can't get it to fail with this PR given the following example:
let i = 0;
const testStr = '\xc2\x80'; // when stored in 1-byte chars, this is a valid UTF-8 encoding
const expected = Buffer.from(testStr).toString('hex');
for(; i < 10_000_000; i++) {
const buf = Buffer.from(testStr);
const ashex = buf.toString('hex');
if (ashex !== expected) {
assert(false);
}
}
The source string is stored as UTF16 (compressed to 1 byte), but still needs to be converted to UTF-8. The fast path can only be implemented as memcpy when only characters from the range 0-127 are present.
I thought
FastOneByteStringmeans that it's only one byte chars?
FastOneByteString means it's a string that's encoded as ISO-8859-1. It doesn't mean that it's ASCII-only.
So, yeah, the current code is definitely buggy, and you'd want to verify that it's ASCII-only if you want to stick to the current approach. validate_ascii should be the right function for that.
@lemire what do you think about htis:
auto dst_ptr = dst_data + offset;
auto src_ptr = reinterpret_cast<const uint8_t*>(src.data);
uint64_t fallback = 0;
size_t i = 0;
for (; i + 8 <= size; i += 8) {
uint64_t v;
memcpy(&v, src_ptr + i, 8);
fallback |= v;
memcpy(dst_ptr + i, src_ptr + i, 8);
}
uint64_t v = 0;
memcpy(&v, src_ptr + i, size - i);
fallback |= v;
memcpy(dst_ptr + i, src_ptr + i, size - i));
if (fallback & UINT64_C(0x8080808080808080) != 0) {
options.fallback = true;
return 0;
}
return size;
as an alternative to:
if (!simdutf::validate_ascii(src.data, size)) {
options.fallback = true;
return 0;
}
memcpy(dst_data + offset, src.data, size);
return size;
CI: https://ci.nodejs.org/job/node-test-pull-request/61423/
CI: https://ci.nodejs.org/job/node-test-pull-request/61433/
@targos PTAL. I have tried to incorporate your feedback. However, the changes are less trivial now and I would ask for some more in-depth reviews.
I'm getting somewhat suspicious benchmark results. Not sure whether or not it's related to this PR:
make && node benchmark/compare.js --new ./out/Release/node --runs 15 --old /Users/robertnagy/.nvm/versions/node/v22.7.0/bin/node --filter buffer-write-string.js -- buffers | Rscript benchmark/compare.R
buffers/buffer-write-string.js n=1000000 len=2048 args='' encoding='ascii' 0.45 % ±2.88% ±3.89% ±5.18%
buffers/buffer-write-string.js n=1000000 len=2048 args='' encoding='utf8' *** 19.83 % ±3.44% ±4.64% ±6.18%
buffers/buffer-write-string.js n=1000000 len=2048 args='offset' encoding='ascii' *** -50.36 % ±6.17% ±8.55% ±11.86%
buffers/buffer-write-string.js n=1000000 len=2048 args='offset' encoding='utf8' *** -76.18 % ±2.89% ±3.99% ±5.51%
Note how the regression only happens in the offset case and also that it applies to ascii which is still only a memcpy as it was in 22.7.
However, compared to 22.6 it looks normal:
buffers/buffer-write-string.js n=1000000 len=2048 args='' encoding='ascii' * 11.85 % ±8.89% ±12.01% ±16.01%
buffers/buffer-write-string.js n=1000000 len=2048 args='' encoding='utf8' *** 18.35 % ±7.82% ±10.58% ±14.15%
buffers/buffer-write-string.js n=1000000 len=2048 args='offset' encoding='ascii' 0.44 % ±7.44% ±10.06% ±13.44%
buffers/buffer-write-string.js n=1000000 len=2048 args='offset' encoding='utf8' *** 22.48 % ±9.88% ±13.33% ±17.74%
CI: https://ci.nodejs.org/job/node-test-pull-request/61474/
CI: https://ci.nodejs.org/job/node-test-pull-request/61485/
I also added a printf at the beginning of the WriteOneByteString method, and weirdly, it is never called with this test until the function is optimized (meaning it doesn't reach line 1539 in SlowWriteString).
I also added a
printfat the beginning of theWriteOneByteStringmethod, and weirdly, it is never called with this test until the function is optimized (meaning it doesn't reach line 1539 inSlowWriteString).
Maybe it doesn't become a one byte string before optimization at which point it goes to the fast method?
@blexrob @jasnell @addaleax @anonrig @lemire C++ experts PTAL
CI: https://ci.nodejs.org/job/node-test-pull-request/61574/
@ronag The implementation of this PR may still have some problems. I used git apply to apply these three buffer-related patches locally
https://github.com/nodejs/node/pull/54526 https://github.com/nodejs/node/pull/54588 https://github.com/nodejs/node/pull/54586
curl -fsSL https://github.com/nodejs/node/pull/54526.patch > 54526.patch
cd /path/to/nodejs/
git patch apply --3way 54526.patch
Then I used the compiled node.exe to build a local webpack project, and the following error appeared
ERROR in ./src/index.tsx + 710 modules
Unexpected non-whitespace character after JSON at position 5 (line 1 column 6)
SyntaxError: Unexpected non-whitespace character after JSON at position 5 (line 1 column 6)
at JSON.parse (<anonymous>)
at ConcatenationScope.matchModuleReference (T:\2\ddb\web\ci\node_modules\.pnpm\[email protected]\node_modules\webpack\lib\ConcatenationScope.js:145:13)
at ConcatenatedModule.codeGeneration (T:\2\ddb\web\ci\node_modules\.pnpm\[email protected]\node_modules\webpack\lib\optimize\ConcatenatedModule.js:1253:41)
...
ERROR in ./src/model.ts + 1 modules
Unexpected token '
', "
" is not valid JSON 0?F
SyntaxError: Unexpected token '
', "
" is not valid JSON 0?F
at JSON.parse (<anonymous>)
at ConcatenationScope.matchModuleReference (T:\2\ddb\web\ci\node_modules\.pnpm\[email protected]\node_modules\webpack\lib\ConcatenationScope.js:145:13)
at ConcatenatedModule.codeGeneration (T:\2\ddb\web\ci\node_modules\.pnpm\[email protected]\node_modules\webpack\lib\optimize\ConcatenatedModule.js:1253:41)
...
After that, I removed these three fast_write_string_xxx in node_buffer.cc, as shown below, recompiled, and the above problem disappeared.
I tried several times, and as long as I used the fast method, the build would fail
I suggest that you try to build some actual webpack projects containing multi-language text or internationalized large projects locally to fully test buffer write
@ShenHongFei Any chance I can get a full repro? It's interesting since both the slow and fast path should behave similarly now. If you don't want to share publicly you can also mail me directly.
@ronag I will try to find a reproducible project when I have some free time.
@ShenHongFei can you try again with your project. I think I might have found the issue.
For future reference:
Simple fuzzing:
const assert = require('node:assert')
const crypto = require('node:crypto')
for (let i = 0; i < 1e6; i++) {
const len = Math.floor(Math.random() * 128)
const buf = Buffer.allocUnsafe(len)
crypto.getRandomValues(buf)
for (let n = 0; n < 16; n++) {
const start = Math.floor(Math.random() * len)
const end = Math.min(len, start + Math.floor(Math.random() * (len - start)))
const src = buf.subarray(start, end)
const str = src.toString('utf8')
let bufNew = Buffer.alloc(buf.length);
let bufOld = Buffer.alloc(buf.length);
bufNew = bufNew.subarray(0, bufNew.utf8Write(str, start))
bufOld = bufOld.subarray(0, bufOld.utf8WriteLegacy(str, start)) // old implementation
assert.deepStrictEqual(bufNew, bufOld)
}
}
https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1626/
@ronag I tried it, and the problem has been fixed. 👍
CI: https://ci.nodejs.org/job/node-test-pull-request/61707/
Commit Queue failed
- Loading data for nodejs/node/pull/54526 ✔ Done loading data for nodejs/node/pull/54526 ----------------------------------- PR info ------------------------------------ Title buffer: validate UTF8 on fast path (#54526) Author Robert Nagy <[email protected]> (@ronag) Branch ronag:fast-utf8-2 -> nodejs:main Labels buffer, c++, author ready, needs-ci Commits 1 - buffer: validate UTF8 on fast path Committers 1 - Robert Nagy <[email protected]> PR-URL: https://github.com/nodejs/node/pull/54526 Fixes: https://github.com/nodejs/node/issues/54521 Reviewed-By: Daniel Lemire <[email protected]> Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Paolo Insogna <[email protected]> ------------------------------ Generated metadata ------------------------------ PR-URL: https://github.com/nodejs/node/pull/54526 Fixes: https://github.com/nodejs/node/issues/54521 Reviewed-By: Daniel Lemire <[email protected]> Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Paolo Insogna <[email protected]> -------------------------------------------------------------------------------- ℹ This PR was created on Fri, 23 Aug 2024 14:34:31 GMT ✔ Approvals: 6 ✔ - Daniel Lemire (@lemire): https://github.com/nodejs/node/pull/54526#pullrequestreview-2257470621 ✔ - Benjamin Gruenbaum (@benjamingr) (TSC): https://github.com/nodejs/node/pull/54526#pullrequestreview-2273156047 ✔ - James M Snell (@jasnell) (TSC): https://github.com/nodejs/node/pull/54526#pullrequestreview-2258976699 ✔ - Anna Henningsen (@addaleax): https://github.com/nodejs/node/pull/54526#pullrequestreview-2258958504 ✔ - Yagiz Nizipli (@anonrig) (TSC): https://github.com/nodejs/node/pull/54526#pullrequestreview-2272343097 ✔ - Paolo Insogna (@ShogunPanda) (TSC): https://github.com/nodejs/node/pull/54526#pullrequestreview-2260113366 ✘ Last GitHub CI failed ℹ Last Full PR CI on 2024-08-30T13:34:03Z: https://ci.nodejs.org/job/node-test-pull-request/61707/ ℹ Last Benchmark CI on 2024-08-30T10:28:48Z: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1626/ - Querying data for job/node-test-pull-request/61707/ ✔ Last Jenkins CI successful -------------------------------------------------------------------------------- ✔ Aborted `git node land` session in /home/runner/work/node/node/.ncuhttps://github.com/nodejs/node/actions/runs/10637813307
CI: https://ci.nodejs.org/job/node-test-pull-request/61722/
CI: https://ci.nodejs.org/job/node-test-pull-request/61734/