opentelemetry-js-contrib icon indicating copy to clipboard operation
opentelemetry-js-contrib copied to clipboard

fix(instrumentation-mysql2)!: Missing masking of sql queries

Open gflachs opened this issue 9 months ago • 15 comments

Which problem is this PR solving?

Based on Issue #1758 and my comment in the PR #1857 this PR is a quickfix for missing masking from sql queries.

Since OpenTelemetry instrumentation should remain as lightweight as possible, introducing a full SQL parser would be too heavy. As a result, the default masking approach is intentionally simplistic. However, users can define their own maskStatementHook, allowing them to leverage a SQL parser or other custom logic for more advanced masking if needed.

Additionally, some use cases may require no masking at all. To accommodate this, the maskStatement setting allows users to disable masking entirely, preserving the original behavior of the MySQL2Instrumentation as it was before.

Short description of the changes

This PR extends MySQL2InstrumentationConfig with two new options:

  • maskStatementHook → A customizable hook for masking SQL statements (optional). Default:
 return query.replace(/\b\d+\b/g, '?').replace(/(["'])(?:(?=(\\?))\2.)*?\1/g, '?')
  • maskStatement → A boolean flag (default: true) that determines whether maskStatementHook is applied at all.

gflachs avatar Feb 25 '25 11:02 gflachs

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: gflachs (ea64bdcf81404d153dc48b7cf359e6d7c04d1f08, ee9d4dd9693b5a851f83b2b3ff3cbbcedd1f327a, 907b1045bb92fe612101674f489dc4d2745e564b, 1c3dc7c198c86add70f0c7a76112acd08ffddc25, a369681e6358372a17d4526ff31b510bbe31e38f, 0a2aad89e30e113c3c645fd5be3b6f2f541139fc, 8a9c0e83cb7632fbe35125299c21611ccba35fba)
  • :white_check_mark: login: maryliag / name: Marylia Gutierrez (963c343432969028630d6f842b5b92f940ee8265, f47d8d457827fbca41828900ac4fe9966d081c2a)

This package does not have an assigned component owner and is considered unmaintained. As such this package is in feature-freeze and this PR will be closed with 14 days unless a new owner or a sponsor (a member of @open-telemetry/javascript-approvers) for the feature is found. It is the responsibility of the author to find a sponsor for this feature. Are you familiar with this package? Consider becoming a component owner.

github-actions[bot] avatar Feb 26 '25 06:02 github-actions[bot]

Please disregard the previous comment by the GH action - this component actually maintained by @raphael-theriault-swi now. I forgot to update the workflow.

I'll open a PR to make sure this comment does not pop up again for this instrumentation.

pichlermarc avatar Feb 26 '25 09:02 pichlermarc

Looks good, left some very minor suggestions but I'm happy with this either way !

Thanks :) Github still claims, that a Review from a Maintainer ist missing :thinking: Do I have to do soemthing else, or is it still the Bug mentioned by @pichlermarc ?

gflachs avatar Mar 04 '25 18:03 gflachs

Approvers are still the ones doing the workflow approval and merging, so this is still waiting on Marc or someone else to merge yes.

raphael-theriault-swi avatar Mar 04 '25 18:03 raphael-theriault-swi

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 89.07%. Comparing base (8d05684) to head (963c343). Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2732      +/-   ##
==========================================
+ Coverage   89.05%   89.07%   +0.02%     
==========================================
  Files         188      188              
  Lines        9189     9197       +8     
  Branches     1891     1890       -1     
==========================================
+ Hits         8183     8192       +9     
+ Misses       1006     1005       -1     
Files with missing lines Coverage Δ
...ages/instrumentation-mysql2/src/instrumentation.ts 95.65% <100.00%> (ø)
packages/instrumentation-mysql2/src/utils.ts 94.54% <100.00%> (+3.05%) :arrow_up:
:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Mar 05 '25 16:03 codecov[bot]

Hmm, looks like the lint step is failing. @gflachs, please run npm run lint:fix and commit the results. Thanks! 🙂

pichlermarc avatar Mar 05 '25 16:03 pichlermarc

Would appreciate the changes to hasValues at the same time since another commit is needed.

raphael-theriault-swi avatar Mar 06 '25 18:03 raphael-theriault-swi

Would appreciate the changes to hasValues at the same time since another commit is needed.

No problem, I'll do both tomorrow morning (MEZ, UTC +1)

gflachs avatar Mar 06 '25 19:03 gflachs

Hmm, looks like the lint step is failing. @gflachs, please run npm run lint:fix and commit the results. Thanks! 🙂

Done, it was complaining about an unnecessary type annotation in maskStatement : boolean = true, since it could already infer the type from the assigned value true.

gflachs avatar Mar 07 '25 10:03 gflachs

Would appreciate the changes to hasValues at the same time since another commit is needed.

I added it in the current commit :)

gflachs avatar Mar 07 '25 10:03 gflachs

Hi @gflachs , thank you for working on this. I have concerns with the performance of the masking itself. Using regex and replacements can be very costly, specially in large queries (I have seen queries with over 20k characters). Can you provide some benchmark comparing with and without your changes, considering a heavy workload with big queries?

Also, since this could cause a performance issue, I don't think we should have maskStatement as default true

maryliag avatar Mar 10 '25 14:03 maryliag

Hi @gflachs , thank you for working on this. I have concerns with the performance of the masking itself. Using regex and replacements can be very costly, specially in large queries (I have seen queries with over 20k characters). Can you provide some benchmark comparing with and without your changes, considering a heavy workload with big queries?

Also, since this could cause a performance issue, I don't think we should have maskStatement as default true

Hey @maryliag, Thanks for your reply and for raising your concerns about the performance of the masking itself—I totally get where you’re coming from, though I don’t fully share the same worry since it’s always a tradeoff between security/privacy and performance.

I’ve put together a quick benchmark to dig into this:

const { performance } = require('perf_hooks');
const fs = require('fs');
const os = require('os');

function getSystemInfo() {
    let cpuLimit = "Unlimited";
    let memLimit = "Unlimited";

    try {
        const cpuData = fs.readFileSync('/sys/fs/cgroup/cpu.max', 'utf8').trim().split(' ');
        if (cpuData[0] !== 'max') {
            cpuLimit = (parseInt(cpuData[0]) / 1000) + " ms CPU time per 100ms";
        }
    } catch (err) {
        console.warn("⚠️  Could not retrieve CPU limit.");
    }

    try {
        const memData = fs.readFileSync('/sys/fs/cgroup/memory.max', 'utf8').trim();
        if (memData !== 'max') {
            memLimit = (parseInt(memData) / (1024 ** 2)).toFixed(2) + " MB";
        }
    } catch (err) {
        console.warn("⚠️  Could not retrieve memory limit.");
    }

    return {
        cpu: os.cpus()[0].model,
        cores: os.cpus().length,
        totalMemGB: (os.totalmem() / (1024 ** 3)).toFixed(2) + " GB",
        platform: os.platform(),
        release: os.release(),
        containerCPU: cpuLimit,
        containerMemory: memLimit
    };
}

function regexMasking(query) {
    return query.replace(/\b\d+\b/g, '?').replace(/(["'])(?:(?=(\\?))\2.)*?\1/g, '?');
}

function generateLargeSQLQuery(size) {
    let query = "SELECT id, name, age FROM users WHERE age = 25 AND name = 'John Doe' ";
    while (query.length < size) {
        query += "UNION SELECT id, 'Random Name', 30 FROM users WHERE id = " + Math.floor(Math.random() * 1000) + " ";
    }
    return query.slice(0, size);
}

function benchmarkFunction(fn, query, iterations = 3) {
    let totalTime = 0;
    for (let i = 0; i < iterations; i++) {
        let start = performance.now();
        fn(query);
        totalTime += (performance.now() - start);
    }
    return (totalTime / iterations).toFixed(2);
}

function benchmark() {
    const sizes = [100, 1000, 10000, 50000, 100000, 1000000];

    console.log("\n🔍 **System Specifications:**\n", JSON.stringify(getSystemInfo(), null, 2));
    console.log("\n🚀 **Benchmark: SQL Masking Performance**");
    console.log("| SQL Size | Regex (ms) |");
    console.log("|----------|------------|");

    for (let size of sizes) {
        const query = generateLargeSQLQuery(size);

        let regexTime = benchmarkFunction(regexMasking, query);

        console.log(`| ${size.toLocaleString()} | ${regexTime} |`);
    }
}

benchmark();

Benchmarking within a docker-container to simulate different server-enviroments:

docker run --rm -it --cpus="0.25" --memory="128m" benchmark_sql:latest

🔍 System Specifications: { "cpu": "AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics", "cores": 16, "totalMemGB": "30.00 GB", "platform": "linux", "release": "6.13.5-200.fc41.x86_64", "containerCPU": "25 ms CPU time per 100ms", "containerMemory": "128.00 MB" }

🚀 Benchmark: SQL Masking Performance

SQL Size Regex (ms)
100 0.07
1,000 0.01
10,000 0.08
50,000 1.11
100,000 0.41
1,000,000 7.56

docker run --rm -it --cpus="0.1" --memory="50m" benchmark_sql:latest

🔍 System Specifications: { "cpu": "AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics", "cores": 16, "totalMemGB": "30.00 GB", "platform": "linux", "release": "6.13.5-200.fc41.x86_64", "containerCPU": "10 ms CPU time per 100ms", "containerMemory": "50.00 MB" }

🚀 Benchmark: SQL Masking Performance

SQL Size Regex (ms)
100 0.06
1,000 0.01
10,000 0.07
50,000 0.48
100,000 0.38
1,000,000 67.37

docker run --rm -it --cpus="0.5" --memory="1g" benchmark_sql:latest

🔍 System Specifications: { "cpu": "AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics", "cores": 16, "totalMemGB": "30.00 GB", "platform": "linux", "release": "6.13.5-200.fc41.x86_64", "containerCPU": "50 ms CPU time per 100ms", "containerMemory": "1024.00 MB" }

🚀 Benchmark: SQL Masking Performance

SQL Size Regex (ms)
100 0.03
1,000 0.01
10,000 0.10
50,000 0.56
100,000 0.43
1,000,000 5.67

My Take

I hear your concern about regex and replacements being costly, especially with large queries like the 20k+ character ones you mentioned. The benchmarks show that for smaller queries (up to 10k characters), the impact is negligible (<0.1 ms). For larger queries (e.g., 1M characters), it can climb to 5–67 ms depending on resources, though a 20k-character query would likely fall in the 0.1–0.2 ms range—still fast, but potentially noticeable under heavy load.

That said, I’d like to push back a bit on setting maskStatement to false by default. Data security and privacy are critical, and masking sensitive data (like numbers and strings) in outputs is a key safeguard. The performance hit, even in extreme cases, seems manageable in most realistic scenarios, and I’d argue the security benefits outweigh the cost. Turning it off by default could expose sensitive data unnecessarily, which feels riskier than a slight performance dip.

gflachs avatar Mar 10 '25 14:03 gflachs

Thank you for performing the tests. Let me clarify my suggestion, I think the default should be:

  • not send this data, since I agree that the security risk is a top priority

then they have the option to:

  • enable to send data and with that we would have the mask of data

in the future we want to take advantage of the config file, so there people will be able to select one of the options:

  1. I'm already sending data without the values, so you can send as is
  2. I'm sending confidential data, so I want to remove everything and not mask, because my system cannot afford performance hits
  3. I'm sending confidential data and I want it to be masked, since my system can afford performance hits

we can even have other performance improvements, such as if query is up to X characters mask it, otherwise remove completely, or even cut to X characters and then mask it

My main concern at the moment is to make this behaviour the default, and a lot of people use ORMs and those tend to create really huge queries

maryliag avatar Mar 10 '25 15:03 maryliag

I'd be curious to see the delta between the masking and sqlstring's format (which is currently used by default), given the latter also uses regex on top of some other non trivial logic. Don't disagree with the performance concerns but I have a feeling masking might not make it worse than it currently is.

raphael-theriault-swi avatar Mar 10 '25 16:03 raphael-theriault-swi

Hi @maryliag @raphael-theriault-swi - just double-checking since I've passed by this PR multiple times on my ready-to-merge list:

is this PR ready to merge, or is some additional input/changes needed from @gflachs? (to match semconv guidance or other things) 🙂

pichlermarc avatar Apr 28 '25 11:04 pichlermarc

It's not ready to merge. Current behaviour: the value os maskStatement is set to default true and sanitization is being made by default.

Changes needed: the default value should be false and no information should be sent on that case. Then IF the value is true, sanitization should be done and the query text can be sent

maryliag avatar Apr 28 '25 14:04 maryliag

Please read my comment for the purpose of the maskStatement flag :)

gflachs avatar Apr 28 '25 14:04 gflachs

Turning it off by default could expose sensitive data unnecessarily

This is why I said that the it needs to be aligned, when the value is false, nothing it sent (so we don't potentially send sensitive information), and when the value is true, you can do the masking + sending the data.

maryliag avatar Apr 28 '25 17:04 maryliag

@gflachs I think this is quite close, just needs the off-by-default comment addressed

dyladan avatar Jun 25 '25 16:06 dyladan

heads up @gflachs

A recent PR that changes the folder structure has been merged. This affects your PR so you need to sync with the main branch.

Happy to help if you need it :)

david-luna avatar Jul 08 '25 17:07 david-luna

I updated the default value of maskStatement to false, and made the correspondent changes in the doc and PR description. That was the only remaining change for this PR, and it should be good to merge now

maryliag avatar Jul 09 '25 20:07 maryliag