node
node copied to clipboard
util: add a method to get string encoding info
util: Add a method to get the encoding information of a string.
Currently, we should check the encoding of the string before we use latin1WriteStatic and ucs2Write to write to the buffer.
This PR adds a method to return the encoding information from V8.
Refs: #56090
Codecov Report
Attention: Patch coverage is 89.65517% with 3 lines in your changes missing coverage. Please review.
Project coverage is 88.54%. Comparing base (
3c2da4b) to head (bd086ce). Report is 592 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/node_v8.cc | 81.25% | 0 Missing and 3 partials :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #56147 +/- ##
==========================================
+ Coverage 87.99% 88.54% +0.55%
==========================================
Files 656 657 +1
Lines 188999 190794 +1795
Branches 35981 36611 +630
==========================================
+ Hits 166301 168933 +2632
+ Misses 15865 15044 -821
+ Partials 6833 6817 -16
| Files with missing lines | Coverage Ξ | |
|---|---|---|
| lib/v8.js | 99.34% <100.00%> (+0.70%) |
:arrow_up: |
| src/node_external_reference.h | 100.00% <ΓΈ> (ΓΈ) |
|
| src/node_v8.cc | 90.67% <81.25%> (-0.52%) |
:arrow_down: |
I've been waiting for some feedback on this PR, could someone take a look at this ? I'd appreciate any feedback you can provide. Thanks! @nodejs-github-bot
CI: https://ci.nodejs.org/job/node-test-pull-request/64079/
Actually, I see that @joyeecheung already left a number of helpful comments on the original ticket. Moving this to the v8 API makes a bit more sense, because it really is engine-specific, and she suggests a better name for it that reflects the subtleties involved here (isOneByteRepresentation), plus points out that utf16le isn't accurate because we'd need to take platform endianness into account.
So βΒ Joyee made a lot of good suggestions here already, and you'll probably just want to incorporate them.
Actually, I see that @joyeecheung already left a number of helpful comments on the original ticket. Moving this to the
v8API makes a bit more sense, because it really is engine-specific, and she suggests a better name for it that reflects the subtleties involved here (isOneByteRepresentation), plus points out thatutf16leisn't accurate because we'd need to take platform endianness into account.So βΒ Joyee made a lot of good suggestions here already, and you'll probably just want to incorporate them.
Understood, I'll correct the API placement. Thanks for pointing it out.
Removing the 'changes requested' marker but this should have better documentation and, ideally, a use case that explains why we're adding this to the API
Could you please review the document changes and trigger the code CI?
CI: https://ci.nodejs.org/job/node-test-pull-request/64224/
Looks like a flaky test. Needs a re-run I think. @nodejs-github-bot
CI: https://ci.nodejs.org/job/node-test-pull-request/64228/
Needs a re-run again, another test has failed this time. @nodejs-github-bot
CI: https://ci.nodejs.org/job/node-test-pull-request/64231/
The CI checks have passed, and your review has been addressed. Would it be possible to merge this PR now? Thank you for your great support!
@nodejs-github-bot Needs a re-run please, I have corrected the documentation.
CI: https://ci.nodejs.org/job/node-test-pull-request/64253/
@addaleax I've updated the PR based on your feedback. Could you please take a moment to check if it's ready to merge or if there's anything else that needs attention?
There is still no clear use case for this. The OP mentions
ucs2Writeandlatin1WriteStaticbut these are internal methods so cannot be used as arguments to add this method publicly.
Sorry, I had a problem with the expression in the OP. I didn't think to use these private APIs; I used buffer.write in the documentation. I have corrected the OP.
Can you please answer @addaleax's question ? https://github.com/nodejs/node/pull/56147#pullrequestreview-2524315487
Can you please answer @addaleax's question ? #56147 (review)
Following her suggestion in the comment https://github.com/nodejs/node/pull/56147#discussion_r1898670399, I have written an example in the documentation to explain why we need this API. https://github.com/nodejs/node/pull/56147/files#diff-fc79b2d1ad702cfaf107d5880b73e8360b36273edda73f128c00641637435c3cR1368
@theweipeng Sure, but to be clear, I wasn't asking for a complex example in the documentation (because complex documentation can easily distract from the important bit βΒ what the method actually does). I'd say that earlier versions of the documentation in this PR were better in that regard.
Based on the documentation example here, I can measure about a 0.7β1.4% runtime performance increase with this method. I guess that's a reason to add this method βΒ it's not a particularly significant difference, and the benefits of using less space will likely significantly outweigh the benefits of faster data copying (so if this is the reason to add this, and if we keep this full-featured example in the documentation, we should at least be honest about it and not refer to significant performance benefits).
Either way, I'm not blocking anything here. If this is good with @jasnell it's good with me.
@theweipeng Sure, but to be clear, I wasn't asking for a complex example in the documentation (because complex documentation can easily distract from the important bit βΒ what the method actually does). I'd say that earlier versions of the documentation in this PR were better in that regard.
Based on the documentation example here, I can measure about a 0.7β1.4% runtime performance increase with this method. I guess that's a reason to add this method βΒ it's not a particularly significant difference, and the benefits of using less space will likely significantly outweigh the benefits of faster data copying (so if this is the reason to add this, and if we keep this full-featured example in the documentation, we should at least be honest about it and not refer to significant performance benefits).
Either way, I'm not blocking anything here. If this is good with @jasnell it's good with me.
Thank you for your patient guidance, I've learned a lot from it. I will simplify that document. Regarding your test, may I take a look at your test code? Because the improvement is quite noticeable on my machine. Here is my test code:
const { isStringOneByteRepresentation } = require("v8");
const bf = Buffer.alloc(1000);
function benchmark(input, topic) {
console.time("before " + topic);
for (let index = 0; index < 999999; index++) {
bf.writeUint32LE(0, Buffer.byteLength(input, 'utf8'));
bf.write(input, 4, 'utf8');
}
console.timeEnd("before " + topic);
console.time("after " + topic);
for (let index = 0; index < 999999; index++) {
if (isStringOneByteRepresentation(input)) {
bf.writeUint32LE(0, input.length);
bf.write(input, 4, 'latin1');
} else {
bf.writeUint32LE(0, input.length * 2);
bf.write(input, 4, 'utf16le');
}
}
console.timeEnd("after " + topic);
console.log("\n");
}
benchmark(new Array(1).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "short latin1");
benchmark(new Array(1).fill(0).map(x => "πππππππππππππππππ").join(''), "short utf16")
benchmark(new Array(5).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "long latin1");
benchmark(new Array(5).fill(0).map(x => "πππππππππππππππππ").join(''), "long utf16")
And here are the results:
before short latin1: 52.94ms
after short latin1: 21.978ms
before short utf16: 123.66ms
after short utf16: 55.009ms
before long latin1: 35.731ms
after long latin1: 24.655ms
before long utf16: 353.766ms
after long utf16: 58.19ms
I think the improvement comes from two aspects: one is the calculation of the byte length of strings, and the other is copying.
CI: https://ci.nodejs.org/job/node-test-pull-request/64311/
@theweipeng I was comparing the current version against always using UTF16-LE, to be clear, not against UTF-8. Otherwise I don't think you end up with a fair comparison (UTF-8 mainly has space saving advantages, but this seems to be about runtime performance instead).
@theweipeng I was comparing the current version against always using UTF16-LE, to be clear, not against UTF-8. Otherwise I don't think you end up with a fair comparison (UTF-8 mainly has space saving advantages, but this seems to be about runtime performance instead).
@addaleax
I compared the current version against always uses UTF-16LE, and there is still a noticeable improvement when the string is in Latin1. I think this imporvement comes from encoding Latin1 to utf16
My machine: Apple M2 Pro 16GB
Here is my code:
const { isStringOneByteRepresentation } = require("v8");
const bf = Buffer.alloc(1000);
function benchmark(input, topic) {
console.time("before " + topic);
for (let index = 0; index < 999999; index++) {
bf.writeUint32LE(0, input.length * 2);
bf.write(input, 4, 'utf16le');
}
console.timeEnd("before " + topic);
console.time("after " + topic);
for (let index = 0; index < 999999; index++) {
if (isStringOneByteRepresentation(input)) {
bf.writeUint32LE(0, input.length);
bf.write(input, 4, 'latin1');
} else {
bf.writeUint32LE(0, input.length * 2);
bf.write(input, 4, 'utf16le');
}
}
console.timeEnd("after " + topic);
console.log("\n");
}
benchmark(new Array(1).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "short latin1");
benchmark(new Array(1).fill(0).map(x => "πππππππππππππππππ").join(''), "short utf16")
benchmark(new Array(5).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "long latin1");
benchmark(new Array(5).fill(0).map(x => "πππππππππππππππππ").join(''), "long utf16")
Here are the results:
before short latin1: 73.69ms
after short latin1: 22.572ms
before short utf16: 60.322ms
after short utf16: 57.602ms
before long latin1: 62.184ms
after long latin1: 25.765ms
before long utf16: 61.797ms
after long utf16: 56.201ms
@theweipeng I was comparing the current version against always using UTF16-LE, to be clear, not against UTF-8. Otherwise I don't think you end up with a fair comparison (UTF-8 mainly has space saving advantages, but this seems to be about runtime performance instead).
@addaleax
I compared the current version against always uses UTF-16LE, and there is still a noticeable improvement when the string is in Latin1. I think this imporvement comes from encoding Latin1 to utf16
My machine: Apple M2 Pro 16GB
Here is my code:
const { isStringOneByteRepresentation } = require("v8"); const bf = Buffer.alloc(1000); function benchmark(input, topic) { console.time("before " + topic); for (let index = 0; index < 999999; index++) { bf.writeUint32LE(0, input.length * 2); bf.write(input, 4, 'utf16le'); } console.timeEnd("before " + topic); console.time("after " + topic); for (let index = 0; index < 999999; index++) { if (isStringOneByteRepresentation(input)) { bf.writeUint32LE(0, input.length); bf.write(input, 4, 'latin1'); } else { bf.writeUint32LE(0, input.length * 2); bf.write(input, 4, 'utf16le'); } } console.timeEnd("after " + topic); console.log("\n"); } benchmark(new Array(1).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "short latin1"); benchmark(new Array(1).fill(0).map(x => "πππππππππππππππππ").join(''), "short utf16") benchmark(new Array(5).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "long latin1"); benchmark(new Array(5).fill(0).map(x => "πππππππππππππππππ").join(''), "long utf16")Here are the results:
before short latin1: 73.69ms after short latin1: 22.572ms before short utf16: 60.322ms after short utf16: 57.602ms before long latin1: 62.184ms after long latin1: 25.765ms before long utf16: 61.797ms after long utf16: 56.201ms
@addaleax My benchmark results are quite good. Could you please help check them?
CI: https://ci.nodejs.org/job/node-test-pull-request/65361/
CI: https://ci.nodejs.org/job/node-test-pull-request/65366/ π
If thereβs anything else I can do to help move this forward, please let me know. Iβm more than happy to make any additional changes or provide further details.
Landed in 6cb0690fccd6