node
node copied to clipboard
UTF+8 encodings are broken
Version
22.7.0
Platform
Linux api-deployment-694785c9f5-8dd8j 5.10.223-211.872.amzn2.x86_64 #1 SMP Mon Jul 29 19:52:29 UTC 2024 x86_64 GNU/Linux
Subsystem
No response
What steps will reproduce the bug?
Hey everyone, I'm not sure how to reproduce but latest node can't parse UTF+8 anymore. It works for the first minute or two (or couple hours if I remove Datadog APM instrumentation) but then returns garbage on the same request. I'm just using postgres.js to fetch and nest.js for the HTTP server. No fancy buffer manipulation.
curl --location 'https://api.vapi.ai/assistant/205deb59-755c-489c-8879-7523b1318ed8' \
--header 'Accept: application/json' \
--header 'Authorization: Bearer XXXXXX'
{"id":"205deb59-755c-489c-8879-7523b1318ed8","orgId":"7616920b-4696-458b-a2aa-3453fd13ace4","name":"éñüç߯","createdAt":"2024-08-24T08:58:16.110Z","updatedAt":"2024-08-24T08:58:16.110Z","isServerUrlSecretSet":false}%
## 2 minutes later
curl --location 'https://api.vapi.ai/assistant/205deb59-755c-489c-8879-7523b1318ed8' \
--header 'Accept: application/json' \
--header 'Authorization: Bearer XXXXXX'
{"id":"205deb59-755c-489c-8879-7523b1318ed8","orgId":"7616920b-4696-458b-a2aa-3453fd13ace4","name":"������","createdAt":"2024-08-24T08:58:16.110Z","updatedAt":"2024-08-24T08:58:16.110Z","isServerUrlSecretSet":false}%
Note how éñüç߯ gets corrupted.
How often does it reproduce? Is there a required condition?
Restart the process, it works for sometime and then corrupts itself.
What is the expected behavior? Why is that the expected behavior?
It should keep returning the current text.
What do you see instead?
Garbage text
Additional information
No response
Update: UTF+8 is fine. It's specifically on ASCII extended set.
I tried other versions:
"name": "aa" # works fine
"name": "дмитрий" # works fine
"name": "💩" # works fine
"name": "¿" # doesn't work
"name" "éñüç߯" doesn't work
Our best guess is ASCII extended set UTF-8 encoded vs normal ASCII extended set are getting mixed up and corrupted
cc @ronag if you have any ideas.
found 3 commits related to encoding:
- https://github.com/nodejs/node/commit/f7f7b0c4988cf83044ff94e7efc8b0e3fdeaef94
- https://github.com/nodejs/node/commit/28ca678f81e3579800d7201fbdad498d16cc0995
- https://github.com/nodejs/node/commit/8ba53ae7b71899c0ba3fc6826a2eb27c88f9da2a
Hi! v22.7.0 has a few known buffer issues, so could you provide a minimal reproduction so the issue can be narrowed down?
Additionally, could you self-moderate your comment containing curse-words, as it may be offensive to some viewers?
Edit: Thanks!
Possibly a duplicate of: #54521
Can you check if this fixes it? https://github.com/nodejs/node/pull/54526
@RedYetiDev
- Ah gotcha, likely too hard to narrow down to minimal reproduction. If you wanna get on a call, I can show you a reproduction that happens within ~8 minutes. Happy to just test when the patch comes out though.
- Done
@ronag
- Similarly, happy to test when the patch comes out. I can setup the build if you really want.
Thanks for confirming both. I'll rollback to 22.6.0 for now.
We also ran into this issue, took us forever to find the culprit
We reproduced it by having a simple express http handler deployed on amazon app runner:
res.status(200).json({
umlaute: 'äöü',
});
};
Further Info: node 22.6 seems not to be affected, but 22.7 is
I'm seeing this show up as failed PostgreSQL queries that contain an umlaut as a parameter, with a cryptic-looking error:
Unable to execute query: "invalid byte sequence for encoding "UTF8"
~~Further, I couldn't reproduce it on Apple silicon, but it fails reliably on Linux.~~
edit: the test case provided by @blexrob fails reliably in 22.7.0 on both Apple Silicon and Linux (22.6.0 works as expected)
let i = 0;
const testStr = "jürge";
const expected = Buffer.from(testStr).toString("hex");
for(; i < 1_000_000; i++) {
const buf = Buffer.from(testStr);
const ashex = buf.toString("hex");
if (ashex !== expected) {
console.log(`Decoding changed in iteration ${i} when changing to FastWriteStringUTF8, got ${ashex}, expected ${expected}`);
break;
}
}
if(i<1_000_000) {
console.error("FAILED after %d iterations",i);
} else
console.log("PASSED after %d iterations",i);
I'd like to remind everyone that "me too" comments only add noise to this already noisy topic. Please refrain from commenting until you have something to add to the conversation
Edit: this isn't directed at any comments. This is meant to deter future "me too" comments, as they occur often with issues like this.
I feel like one of the patches in v22.8.0 (#54560), when it lands, will resolve this issue. Once that lands, please post a comment whether it resolves this issue. Given it's current state, that could be a few days.
I assume you have already tracked this down, but I believe the issue is basically:
- After enough calls to some string to buffer writing API, V8 optimizes the call to use the Fast API path.
- Node.js'
FastWriteString(I'm only guessing this is the API in question, but it is likely the one) gets the call and incorrectly assumes that thev8::FastOneByteStringit is given is ASCII: This is not the case,OneByteStringis Latin-1 encoded. - The function then directly copies the data into the destination buffer here. The buffer is now assumed to contain the string's data as UTF-8, but instead contains the data as Latin-1.
#54565 will fix this issue for the v22.8.0 release. Then, #54526 (and similar) will be evaluated for a future release.
When #54565 lands, I'll close this issue
This kept us up a couple of nights. Thank you for fixing it!
I am not familiar with the internals of node, but I just lost a few days because of this. I could not replicate efficiently because it takes a while to happen.
What gave it away was that the same request lifecycle returned intact UTF-8 string to the browser and corrupted UTF-8 to the logger service, which spins up a new worker for log transport.
When #54565 lands, I'll close this issue
This PR has landed. Expect the release to follow shortly:
I hope that the test coverage gets improved with the fix in 22.8
Did this land in 22.8.0? Could not find it in the release notes:
https://nodejs.org/en/blog/release/v22.8.0
[
e071651bb2] - src: disable fast methods forbuffer.write(Michaël Zasso) #54565
if this broke your data when saving to mongo, here is a script to help fix it: https://github.com/nicholas-long/mongo-node-fix-54543