node icon indicating copy to clipboard operation
node copied to clipboard

util: add a method to get string encoding info

Open theweipeng opened this issue 11 months ago β€’ 1 comments

util: Add a method to get the encoding information of a string.

Currently, we should check the encoding of the string before we use latin1WriteStatic and ucs2Write to write to the buffer. This PR adds a method to return the encoding information from V8.

Refs: #56090

theweipeng avatar Dec 05 '24 16:12 theweipeng

Codecov Report

Attention: Patch coverage is 89.65517% with 3 lines in your changes missing coverage. Please review.

Project coverage is 88.54%. Comparing base (3c2da4b) to head (bd086ce). Report is 592 commits behind head on main.

Files with missing lines Patch % Lines
src/node_v8.cc 81.25% 0 Missing and 3 partials :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #56147      +/-   ##
==========================================
+ Coverage   87.99%   88.54%   +0.55%     
==========================================
  Files         656      657       +1     
  Lines      188999   190794    +1795     
  Branches    35981    36611     +630     
==========================================
+ Hits       166301   168933    +2632     
+ Misses      15865    15044     -821     
+ Partials     6833     6817      -16     
Files with missing lines Coverage Ξ”
lib/v8.js 99.34% <100.00%> (+0.70%) :arrow_up:
src/node_external_reference.h 100.00% <ΓΈ> (ΓΈ)
src/node_v8.cc 90.67% <81.25%> (-0.52%) :arrow_down:

... and 138 files with indirect coverage changes

codecov[bot] avatar Dec 05 '24 22:12 codecov[bot]

I've been waiting for some feedback on this PR, could someone take a look at this ? I'd appreciate any feedback you can provide. Thanks! @nodejs-github-bot

theweipeng avatar Dec 17 '24 07:12 theweipeng

CI: https://ci.nodejs.org/job/node-test-pull-request/64079/

nodejs-github-bot avatar Dec 17 '24 10:12 nodejs-github-bot

Actually, I see that @joyeecheung already left a number of helpful comments on the original ticket. Moving this to the v8 API makes a bit more sense, because it really is engine-specific, and she suggests a better name for it that reflects the subtleties involved here (isOneByteRepresentation), plus points out that utf16le isn't accurate because we'd need to take platform endianness into account.

So – Joyee made a lot of good suggestions here already, and you'll probably just want to incorporate them.

addaleax avatar Dec 17 '24 20:12 addaleax

Actually, I see that @joyeecheung already left a number of helpful comments on the original ticket. Moving this to the v8 API makes a bit more sense, because it really is engine-specific, and she suggests a better name for it that reflects the subtleties involved here (isOneByteRepresentation), plus points out that utf16le isn't accurate because we'd need to take platform endianness into account.

So – Joyee made a lot of good suggestions here already, and you'll probably just want to incorporate them.

Understood, I'll correct the API placement. Thanks for pointing it out.

theweipeng avatar Dec 18 '24 02:12 theweipeng

Removing the 'changes requested' marker but this should have better documentation and, ideally, a use case that explains why we're adding this to the API

Could you please review the document changes and trigger the code CI?

theweipeng avatar Dec 26 '24 06:12 theweipeng

CI: https://ci.nodejs.org/job/node-test-pull-request/64224/

nodejs-github-bot avatar Dec 26 '24 16:12 nodejs-github-bot

Looks like a flaky test. Needs a re-run I think. @nodejs-github-bot

theweipeng avatar Dec 27 '24 01:12 theweipeng

CI: https://ci.nodejs.org/job/node-test-pull-request/64228/

nodejs-github-bot avatar Dec 27 '24 03:12 nodejs-github-bot

Needs a re-run again, another test has failed this time. @nodejs-github-bot

theweipeng avatar Dec 27 '24 05:12 theweipeng

CI: https://ci.nodejs.org/job/node-test-pull-request/64231/

nodejs-github-bot avatar Dec 27 '24 08:12 nodejs-github-bot

The CI checks have passed, and your review has been addressed. Would it be possible to merge this PR now? Thank you for your great support!

theweipeng avatar Dec 27 '24 09:12 theweipeng

@nodejs-github-bot Needs a re-run please, I have corrected the documentation.

theweipeng avatar Dec 28 '24 11:12 theweipeng

CI: https://ci.nodejs.org/job/node-test-pull-request/64253/

nodejs-github-bot avatar Dec 29 '24 16:12 nodejs-github-bot

@addaleax I've updated the PR based on your feedback. Could you please take a moment to check if it's ready to merge or if there's anything else that needs attention?

theweipeng avatar Dec 31 '24 06:12 theweipeng

There is still no clear use case for this. The OP mentions ucs2Write and latin1WriteStatic but these are internal methods so cannot be used as arguments to add this method publicly.

Sorry, I had a problem with the expression in the OP. I didn't think to use these private APIs; I used buffer.write in the documentation. I have corrected the OP.

theweipeng avatar Dec 31 '24 07:12 theweipeng

Can you please answer @addaleax's question ? https://github.com/nodejs/node/pull/56147#pullrequestreview-2524315487

targos avatar Dec 31 '24 08:12 targos

Can you please answer @addaleax's question ? #56147 (review)

Following her suggestion in the comment https://github.com/nodejs/node/pull/56147#discussion_r1898670399, I have written an example in the documentation to explain why we need this API. https://github.com/nodejs/node/pull/56147/files#diff-fc79b2d1ad702cfaf107d5880b73e8360b36273edda73f128c00641637435c3cR1368

theweipeng avatar Dec 31 '24 08:12 theweipeng

@theweipeng Sure, but to be clear, I wasn't asking for a complex example in the documentation (because complex documentation can easily distract from the important bit – what the method actually does). I'd say that earlier versions of the documentation in this PR were better in that regard.

Based on the documentation example here, I can measure about a 0.7–1.4% runtime performance increase with this method. I guess that's a reason to add this method – it's not a particularly significant difference, and the benefits of using less space will likely significantly outweigh the benefits of faster data copying (so if this is the reason to add this, and if we keep this full-featured example in the documentation, we should at least be honest about it and not refer to significant performance benefits).

Either way, I'm not blocking anything here. If this is good with @jasnell it's good with me.

addaleax avatar Jan 03 '25 00:01 addaleax

@theweipeng Sure, but to be clear, I wasn't asking for a complex example in the documentation (because complex documentation can easily distract from the important bit – what the method actually does). I'd say that earlier versions of the documentation in this PR were better in that regard.

Based on the documentation example here, I can measure about a 0.7–1.4% runtime performance increase with this method. I guess that's a reason to add this method – it's not a particularly significant difference, and the benefits of using less space will likely significantly outweigh the benefits of faster data copying (so if this is the reason to add this, and if we keep this full-featured example in the documentation, we should at least be honest about it and not refer to significant performance benefits).

Either way, I'm not blocking anything here. If this is good with @jasnell it's good with me.

Thank you for your patient guidance, I've learned a lot from it. I will simplify that document. Regarding your test, may I take a look at your test code? Because the improvement is quite noticeable on my machine. Here is my test code:

const { isStringOneByteRepresentation } = require("v8");

const bf = Buffer.alloc(1000);

function benchmark(input, topic) {
    console.time("before " + topic);
    for (let index = 0; index < 999999; index++) {
        bf.writeUint32LE(0, Buffer.byteLength(input, 'utf8'));
        bf.write(input, 4, 'utf8');
    }
    console.timeEnd("before " + topic);
    
    console.time("after " + topic);
    for (let index = 0; index < 999999; index++) {
        if (isStringOneByteRepresentation(input)) {
            bf.writeUint32LE(0, input.length);
            bf.write(input, 4, 'latin1');
        } else {
            bf.writeUint32LE(0, input.length * 2);
            bf.write(input, 4, 'utf16le');
        }
    }
    console.timeEnd("after " + topic);
    console.log("\n");
}

benchmark(new Array(1).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "short latin1");
benchmark(new Array(1).fill(0).map(x => "😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁").join(''), "short utf16")


benchmark(new Array(5).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "long latin1");
benchmark(new Array(5).fill(0).map(x => "😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁").join(''), "long utf16")

And here are the results:

before short latin1: 52.94ms
after short latin1: 21.978ms


before short utf16: 123.66ms
after short utf16: 55.009ms


before long latin1: 35.731ms
after long latin1: 24.655ms


before long utf16: 353.766ms
after long utf16: 58.19ms

I think the improvement comes from two aspects: one is the calculation of the byte length of strings, and the other is copying.

theweipeng avatar Jan 03 '25 02:01 theweipeng

CI: https://ci.nodejs.org/job/node-test-pull-request/64311/

nodejs-github-bot avatar Jan 03 '25 17:01 nodejs-github-bot

@theweipeng I was comparing the current version against always using UTF16-LE, to be clear, not against UTF-8. Otherwise I don't think you end up with a fair comparison (UTF-8 mainly has space saving advantages, but this seems to be about runtime performance instead).

addaleax avatar Jan 03 '25 20:01 addaleax

@theweipeng I was comparing the current version against always using UTF16-LE, to be clear, not against UTF-8. Otherwise I don't think you end up with a fair comparison (UTF-8 mainly has space saving advantages, but this seems to be about runtime performance instead).

@addaleax

I compared the current version against always uses UTF-16LE, and there is still a noticeable improvement when the string is in Latin1. I think this imporvement comes from encoding Latin1 to utf16

My machine: Apple M2 Pro 16GB

Here is my code:

const { isStringOneByteRepresentation } = require("v8");

const bf = Buffer.alloc(1000);

function benchmark(input, topic) {
    console.time("before " + topic);
    for (let index = 0; index < 999999; index++) {
        bf.writeUint32LE(0, input.length * 2);
        bf.write(input, 4, 'utf16le');
    }
    console.timeEnd("before " + topic);
    
    console.time("after " + topic);
    for (let index = 0; index < 999999; index++) {
        if (isStringOneByteRepresentation(input)) {
            bf.writeUint32LE(0, input.length);
            bf.write(input, 4, 'latin1');
        } else {
            bf.writeUint32LE(0, input.length * 2);
            bf.write(input, 4, 'utf16le');
        }
    }
    console.timeEnd("after " + topic);
    console.log("\n");
}

benchmark(new Array(1).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "short latin1");
benchmark(new Array(1).fill(0).map(x => "😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁").join(''), "short utf16")


benchmark(new Array(5).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "long latin1");
benchmark(new Array(5).fill(0).map(x => "😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁").join(''), "long utf16")

Here are the results:

before short latin1: 73.69ms
after short latin1: 22.572ms


before short utf16: 60.322ms
after short utf16: 57.602ms


before long latin1: 62.184ms
after long latin1: 25.765ms


before long utf16: 61.797ms
after long utf16: 56.201ms

theweipeng avatar Jan 04 '25 02:01 theweipeng

@theweipeng I was comparing the current version against always using UTF16-LE, to be clear, not against UTF-8. Otherwise I don't think you end up with a fair comparison (UTF-8 mainly has space saving advantages, but this seems to be about runtime performance instead).

@addaleax

I compared the current version against always uses UTF-16LE, and there is still a noticeable improvement when the string is in Latin1. I think this imporvement comes from encoding Latin1 to utf16

My machine: Apple M2 Pro 16GB

Here is my code:

const { isStringOneByteRepresentation } = require("v8");

const bf = Buffer.alloc(1000);

function benchmark(input, topic) {
    console.time("before " + topic);
    for (let index = 0; index < 999999; index++) {
        bf.writeUint32LE(0, input.length * 2);
        bf.write(input, 4, 'utf16le');
    }
    console.timeEnd("before " + topic);
    
    console.time("after " + topic);
    for (let index = 0; index < 999999; index++) {
        if (isStringOneByteRepresentation(input)) {
            bf.writeUint32LE(0, input.length);
            bf.write(input, 4, 'latin1');
        } else {
            bf.writeUint32LE(0, input.length * 2);
            bf.write(input, 4, 'utf16le');
        }
    }
    console.timeEnd("after " + topic);
    console.log("\n");
}

benchmark(new Array(1).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "short latin1");
benchmark(new Array(1).fill(0).map(x => "😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁").join(''), "short utf16")


benchmark(new Array(5).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "long latin1");
benchmark(new Array(5).fill(0).map(x => "😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁").join(''), "long utf16")

Here are the results:

before short latin1: 73.69ms
after short latin1: 22.572ms


before short utf16: 60.322ms
after short utf16: 57.602ms


before long latin1: 62.184ms
after long latin1: 25.765ms


before long utf16: 61.797ms
after long utf16: 56.201ms

@addaleax My benchmark results are quite good. Could you please help check them?

theweipeng avatar Jan 16 '25 07:01 theweipeng

CI: https://ci.nodejs.org/job/node-test-pull-request/65361/

nodejs-github-bot avatar Feb 21 '25 15:02 nodejs-github-bot

CI: https://ci.nodejs.org/job/node-test-pull-request/65366/ πŸ’š

nodejs-github-bot avatar Feb 21 '25 20:02 nodejs-github-bot

If there’s anything else I can do to help move this forward, please let me know. I’m more than happy to make any additional changes or provide further details.

theweipeng avatar Feb 24 '25 09:02 theweipeng

Landed in 6cb0690fccd6

jasnell avatar Feb 25 '25 16:02 jasnell