node icon indicating copy to clipboard operation
node copied to clipboard

Avoid match looping in sourceMappingURL tag detection by using negative lookahead

Open Llorx opened this issue 6 months ago • 7 comments

I think that a single regex is better than looping a regex multiple times. The code is smaller, using const and such.

Llorx avatar Jun 13 '25 21:06 Llorx

I can't disagree with code simplicity, do you have benchmarks? I mean, regexp is not the nicest thing to maintain

juanarbol avatar Jun 13 '25 22:06 juanarbol

I can't disagree with code simplicity, do you have benchmarks?

I didn't run any benchmark. Going to do a PoC with iso-bench and post the result.

I mean, regexp is not the nicest thing to maintain

Yeah, that's true... 😔

Llorx avatar Jun 13 '25 22:06 Llorx

EDIT: Invalid test. Rewriting...

Llorx avatar Jun 13 '25 22:06 Llorx

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 90.07%. Comparing base (5fe7800) to head (78040e3). Report is 37 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #58704      +/-   ##
==========================================
- Coverage   90.16%   90.07%   -0.09%     
==========================================
  Files         637      640       +3     
  Lines      188001   188287     +286     
  Branches    36881    36915      +34     
==========================================
+ Hits       169509   169602      +93     
- Misses      11238    11400     +162     
- Partials     7254     7285      +31     

see 72 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Jun 13 '25 22:06 codecov[bot]

Done. Summary:

  • A bit of improvement on single comments, and equalizing the more lines the file has (from 1.16x to 1.00x).
  • A lot of worsening on multiple comments, but getting a bit better the more lines the file has (from 17.82x to 14.49x).

I'm not sure the amount of files with multiple comments that we are going to find. Still, can process from thousands to millions of files per second.

I agree that this is making the regex more difficult to read, though. Negative lookaheads are tricky.

I don't know how do you feel about this. I don't event know how I feel about this. In non-performance-critical places I prefer simpler code, and I'm not sure if processing thousands of sourcemaps per second is considered low performance in this context. I don't know how many projects with files with 2000+ lines and multiple sourcemap comments are around there. Still, if the project has 500 files like that (wich are a lot in my opinion), the performance difference is not going to be noticeable I think.

Benchmark code:

const { IsoBench } = require("iso-bench");

const negativeLookaheadRegex = /(\/[*/]#\s+sourceMappingURL=)(?<sourceMappingURL>[^\s]+)(?![\s\S]*\1)/;
const multipleMatchRegex = /\/[*/]#\s+sourceMappingURL=(?<sourceMappingURL>[^\s]+)/g;

const singleComment = {
    "2lines": `test string yay ok
//# sourceMappingURL=erfgasdfgasdfgasdgasdg`,

    "20lines": `${`test string yay ok\n`.repeat(19)}
//# sourceMappingURL=erfgasdfgasdfgasdgasdg`,

    "200lines": `${`test string yay ok\n`.repeat(199)}
//# sourceMappingURL=erfgasdfgasdfgasdgasdg`,

    "2000lines": `${`test string yay ok\n`.repeat(1990)}
//# sourceMappingURL=erfgasdfgasdfgasdgasdg`
};
const doubleComment = {
    "2lines": `//# sourceMappingURL=g44tgbh463h643h6
//# sourceMappingURL=erfgasdfgasdfgasdgasdg`,

    "20lines": `${new Array(19).fill(0).map((_, i) => {
        return i % 10 === 9 ? `//# sourceMappingURL=g44tgbh463h643h6${i}` : `test string yay ok ${i}`;
    }).join("\n")}}
//# sourceMappingURL=erfgasdfgasdfgasdgasdg`,

    "200lines": `${new Array(199).fill(0).map((_, i) => {
        return i % 100 === 99 ? `//# sourceMappingURL=g44tgbh463h643h6${i}` : `test string yay ok ${i}`;
    }).join("\n")}}
//# sourceMappingURL=erfgasdfgasdfgasdgasdg`,

    "2000lines": `${new Array(1999).fill(0).map((_, i) => {
        return i % 1000 === 999 ? `//# sourceMappingURL=g44tgbh463h643h6${i}` : `test string yay ok ${i}`;
    }).join("\n")}}
//# sourceMappingURL=erfgasdfgasdfgasdgasdg`
};

function matchNegativeLookahead(code) {
    const match = negativeLookaheadRegex.exec(code);
    if (match == null || !match.groups) {
        return null;
    }
    return match.groups.sourceMappingURL;
}
function matchLoop(code) {
    let match;
    let lastMatch;
    while ((match = multipleMatchRegex.exec(code))) {
        lastMatch = match;
    }
    if (lastMatch == null || !lastMatch.groups) {
        return null;
    }
    return lastMatch.groups.sourceMappingURL;
}
function add(bench, code) {
    return bench.add("negative lookahead", () => {
        return matchNegativeLookahead(code);
    }).add("match loop", () => {
        return matchLoop(code);
    });
}
const bench = new IsoBench();
add(bench, singleComment["2lines"]).endGroup("Single comment: 2 lines");
add(bench, singleComment["20lines"]).endGroup("Single comment: 20 lines");
add(bench, singleComment["200lines"]).endGroup("Single comment: 200 lines");
add(bench, singleComment["2000lines"]).endGroup("Single comment: 2000 lines");
add(bench, doubleComment["2lines"]).endGroup("Multiple comment: 2 lines");
add(bench, doubleComment["20lines"]).endGroup("Multiple comment: 20 lines");
add(bench, doubleComment["200lines"]).endGroup("Multiple comment: 200 lines");
add(bench, doubleComment["2000lines"]).endGroup("Multiple comment: 2000 lines");
bench.consoleLog().run();

Result: image

Llorx avatar Jun 13 '25 23:06 Llorx

Ok, I tried with different approaches, and I think that I found a good balance between performance an code, as performance is good (even better with single and multiple comment lines), has less code, uses const and avoids explicit loops.

I tested:

  • matchNegativeLookahead: Using a negative lookahead with const.
  • matchLoop: The original one using exec loops and let.
  • matchReduce: Using matchAll and reduce to get the last element with const.
  • matchArray: Using matchAll, Array.from and at to get the last element with const.
  • matchGlobal: Using a global match and at to get the last element with const, and then indexOf to substring the URL part.

Personally, matchGlobal is the one I like the most, as has no explicit loops, less code, uses const and has a good performance.

If the reviewer is OK with this, I can push a commit with the matchGlobal code.

EDIT: Added "no comment" benchmarks.

Benchmark code:

const { IsoBench } = require("iso-bench");

const negativeLookaheadRegex = /(\/[*/]#\s+sourceMappingURL=)(?<sourceMappingURL>[^\s]+)(?![\s\S]*\1)/;
const globalMatch = /\/[*/]#\s+sourceMappingURL=[^\s]+/g;
const multipleMatchRegex = /\/[*/]#\s+sourceMappingURL=(?<sourceMappingURL>[^\s]+)/g;

const noComment = {
    "2lines": "test string yay ok\n".repeat(2),
    "20lines": "test string yay ok\n".repeat(20),
    "200lines": "test string yay ok\n".repeat(200),
    "2000lines": "test string yay ok\n".repeat(2000)
};
const singleComment = {
    "2lines": `test string yay ok
//# sourceMappingURL=erfgasdfgasdfgasdgasdg`,

    "20lines": `${"test string yay ok\n".repeat(19)}
//# sourceMappingURL=erfgasdfgasdfgasdgasdg`,

    "200lines": `${"test string yay ok\n".repeat(199)}
//# sourceMappingURL=erfgasdfgasdfgasdgasdg`,

    "2000lines": `${"test string yay ok\n".repeat(1990)}
//# sourceMappingURL=erfgasdfgasdfgasdgasdg`
};
const doubleComment = {
    "2lines": `//# sourceMappingURL=g44tgbh463h643h6
//# sourceMappingURL=erfgasdfgasdfgasdgasdg`,

    "20lines": `${new Array(19).fill(0).map((_, i) => {
        return i === 10 ? `//# sourceMappingURL=g44tgbh463h643h6${i}` : `test string yay ok ${i}`;
    }).join("\n")}}
//# sourceMappingURL=erfgasdfgasdfgasdgasdg`,

    "200lines": `${new Array(199).fill(0).map((_, i) => {
        return i === 100 ? `//# sourceMappingURL=g44tgbh463h643h6${i}` : `test string yay ok ${i}`;
    }).join("\n")}}
//# sourceMappingURL=erfgasdfgasdfgasdgasdg`,

    "2000lines": `${new Array(1999).fill(0).map((_, i) => {
        return i === 1000 ? `//# sourceMappingURL=g44tgbh463h643h6${i}` : `test string yay ok ${i}`;
    }).join("\n")}}
//# sourceMappingURL=erfgasdfgasdfgasdgasdg`
};

// Test functions
function matchNegativeLookahead(code) {
    const match = negativeLookaheadRegex.exec(code);
    if (match == null || !match.groups) {
        return null;
    }
    return match.groups.sourceMappingURL;
}
function matchLoop(code) {
    let match;
    let lastMatch;
    while ((match = multipleMatchRegex.exec(code))) {
        lastMatch = match;
    }
    if (lastMatch == null || !lastMatch.groups) {
        return null;
    }
    return lastMatch.groups.sourceMappingURL;
}
function matchReduce(code) {
    const match = code.matchAll(multipleMatchRegex).reduce((_, match) => match, null);
    if (match == null || !match.groups) {
        return null;
    }
    return match.groups.sourceMappingURL;
}
function matchArray(code) {
    const match = Array.from(code.matchAll(multipleMatchRegex)).at(-1);
    if (match == null || !match.groups) {
        return null;
    }
    return match.groups.sourceMappingURL;
}
function matchGlobal(code) {
    const match = code.match(globalMatch)?.at(-1);
    if (match == null) {
        return null;
    }
    return match.substring(match.indexOf("=") + 1);
}
// End test functions

function add(bench, code) {
    return bench.add("match loop (original)", () => {
        return matchLoop(code);
    }).add("match global (new one I like)", () => {
        return matchGlobal(code);
    }).add("negative lookahead (meh on multiline)", () => {
        return matchNegativeLookahead(code);
    }).add("match reduce (meh)", () => {
        return matchReduce(code);
    }).add("match array (meh)", () => {
        return matchArray(code);
    });
}
const bench = new IsoBench();
add(bench, singleComment["2lines"]).endGroup("Single comment: 2 lines");
add(bench, singleComment["20lines"]).endGroup("Single comment: 20 lines");
add(bench, singleComment["200lines"]).endGroup("Single comment: 200 lines");
add(bench, singleComment["2000lines"]).endGroup("Single comment: 2000 lines");
add(bench, doubleComment["2lines"]).endGroup("Multiple comment: 2 lines");
add(bench, doubleComment["20lines"]).endGroup("Multiple comment: 20 lines");
add(bench, doubleComment["200lines"]).endGroup("Multiple comment: 200 lines");
add(bench, doubleComment["2000lines"]).endGroup("Multiple comment: 2000 lines");
add(bench, noComment["2lines"]).endGroup("No comment: 2 lines");
add(bench, noComment["20lines"]).endGroup("No comment: 20 lines");
add(bench, noComment["200lines"]).endGroup("No comment: 200 lines");
add(bench, noComment["2000lines"]).endGroup("No comment: 2000 lines");
bench.consoleLog().run();

Benchmark results: image

Llorx avatar Jun 14 '25 08:06 Llorx

I'm not sure what subsystem this applies to...

Llorx avatar Jun 14 '25 18:06 Llorx

I think maybe src: is fine.

Ethan-Arrowood avatar Jun 16 '25 15:06 Ethan-Arrowood

Could also use process:

If you ever are unsure, i recommend looking at what other PRs used that previously modified this code.

Ethan-Arrowood avatar Jun 16 '25 15:06 Ethan-Arrowood

Ok, I'm going with lib as it was the very previous commit for these lines that I'm changing.

Llorx avatar Jun 16 '25 19:06 Llorx

@legendecas I've modified the eval test. Do you think that's enough or you prefer a single file specific for this?

Llorx avatar Jun 19 '25 18:06 Llorx

@legendecas I noticed a problem with the negative lookahead regex. If you have mixed //# sourceMappingURL= and /* sourceMappingURL= comments it will fail, so instead of explaining the regexes, I went by simplifying the regexes and using the matchGlobal explained here https://github.com/nodejs/node/pull/58704#issuecomment-2972484345 which doesn't have that problem.

It also has more performance than the negative lookahead and simplified both the negative lookahead regex and the original exec loop code, so I think that is a better approach. What do you think?

Llorx avatar Jun 21 '25 07:06 Llorx

Ok, at this point I don't know what is happening. A linter about prototype pollution fails with StringPrototypeMatch and I'm not sure why as I'm using the primordial like all the other StringPrototypeX.

Llorx avatar Jun 21 '25 07:06 Llorx

Ok, at this point I don't know what is happening. A linter about prototype pollution fails with StringPrototypeMatch and I'm not sure why as I'm using the primordial like all the other StringPrototypeX.

Checkout https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/match#description for explaination about the prototype pollution (i.e. RegExp.prototype[Symbol.match]() in this case). An alternative to StringPrototypeMatch is using RegExpPrototypeSymbolMatch directly

legendecas avatar Jun 21 '25 17:06 legendecas

Checkout https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/match#description for explaination about the prototype pollution (i.e. RegExp.prototype[Symbol.match]() in this case). An alternative to StringPrototypeMatch is using RegExpPrototypeSymbolMatch directly

Ahhh makes sense! Thank you.

Llorx avatar Jun 22 '25 18:06 Llorx

Now it complains that RegExpPrototypeSymbolMatch looks up the exec property. I'm going to close the PR as I don't have time for this :-(

Llorx avatar Jun 22 '25 18:06 Llorx