fix(instrumentation-mysql2)!: Missing masking of sql queries
Which problem is this PR solving?
Based on Issue #1758 and my comment in the PR #1857 this PR is a quickfix for missing masking from sql queries.
Since OpenTelemetry instrumentation should remain as lightweight as possible, introducing a full SQL parser would be too heavy. As a result, the default masking approach is intentionally simplistic. However, users can define their own maskStatementHook, allowing them to leverage a SQL parser or other custom logic for more advanced masking if needed.
Additionally, some use cases may require no masking at all. To accommodate this, the maskStatement setting allows users to disable masking entirely, preserving the original behavior of the MySQL2Instrumentation as it was before.
Short description of the changes
This PR extends MySQL2InstrumentationConfig with two new options:
maskStatementHook→ A customizable hook for masking SQL statements (optional). Default:
return query.replace(/\b\d+\b/g, '?').replace(/(["'])(?:(?=(\\?))\2.)*?\1/g, '?')
maskStatement→ A boolean flag (default: true) that determines whether maskStatementHook is applied at all.
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: gflachs (ea64bdcf81404d153dc48b7cf359e6d7c04d1f08, ee9d4dd9693b5a851f83b2b3ff3cbbcedd1f327a, 907b1045bb92fe612101674f489dc4d2745e564b, 1c3dc7c198c86add70f0c7a76112acd08ffddc25, a369681e6358372a17d4526ff31b510bbe31e38f, 0a2aad89e30e113c3c645fd5be3b6f2f541139fc, 8a9c0e83cb7632fbe35125299c21611ccba35fba)
- :white_check_mark: login: maryliag / name: Marylia Gutierrez (963c343432969028630d6f842b5b92f940ee8265, f47d8d457827fbca41828900ac4fe9966d081c2a)
This package does not have an assigned component owner and is considered unmaintained. As such this package is in feature-freeze and this PR will be closed with 14 days unless a new owner or a sponsor (a member of @open-telemetry/javascript-approvers) for the feature is found. It is the responsibility of the author to find a sponsor for this feature. Are you familiar with this package? Consider becoming a component owner.
Please disregard the previous comment by the GH action - this component actually maintained by @raphael-theriault-swi now. I forgot to update the workflow.
I'll open a PR to make sure this comment does not pop up again for this instrumentation.
Looks good, left some very minor suggestions but I'm happy with this either way !
Thanks :) Github still claims, that a Review from a Maintainer ist missing :thinking: Do I have to do soemthing else, or is it still the Bug mentioned by @pichlermarc ?
Approvers are still the ones doing the workflow approval and merging, so this is still waiting on Marc or someone else to merge yes.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 89.07%. Comparing base (
8d05684) to head (963c343). Report is 1 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #2732 +/- ##
==========================================
+ Coverage 89.05% 89.07% +0.02%
==========================================
Files 188 188
Lines 9189 9197 +8
Branches 1891 1890 -1
==========================================
+ Hits 8183 8192 +9
+ Misses 1006 1005 -1
| Files with missing lines | Coverage Δ | |
|---|---|---|
| ...ages/instrumentation-mysql2/src/instrumentation.ts | 95.65% <100.00%> (ø) |
|
| packages/instrumentation-mysql2/src/utils.ts | 94.54% <100.00%> (+3.05%) |
:arrow_up: |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
Hmm, looks like the lint step is failing.
@gflachs, please run npm run lint:fix and commit the results. Thanks! 🙂
Would appreciate the changes to hasValues at the same time since another commit is needed.
Would appreciate the changes to
hasValuesat the same time since another commit is needed.
No problem, I'll do both tomorrow morning (MEZ, UTC +1)
Hmm, looks like the lint step is failing. @gflachs, please run
npm run lint:fixand commit the results. Thanks! 🙂
Done, it was complaining about an unnecessary type annotation in maskStatement : boolean = true, since it could already infer the type from the assigned value true.
Would appreciate the changes to
hasValuesat the same time since another commit is needed.
I added it in the current commit :)
Hi @gflachs , thank you for working on this. I have concerns with the performance of the masking itself. Using regex and replacements can be very costly, specially in large queries (I have seen queries with over 20k characters). Can you provide some benchmark comparing with and without your changes, considering a heavy workload with big queries?
Also, since this could cause a performance issue, I don't think we should have maskStatement as default true
Hi @gflachs , thank you for working on this. I have concerns with the performance of the masking itself. Using regex and replacements can be very costly, specially in large queries (I have seen queries with over 20k characters). Can you provide some benchmark comparing with and without your changes, considering a heavy workload with big queries?
Also, since this could cause a performance issue, I don't think we should have
maskStatementas defaulttrue
Hey @maryliag, Thanks for your reply and for raising your concerns about the performance of the masking itself—I totally get where you’re coming from, though I don’t fully share the same worry since it’s always a tradeoff between security/privacy and performance.
I’ve put together a quick benchmark to dig into this:
const { performance } = require('perf_hooks');
const fs = require('fs');
const os = require('os');
function getSystemInfo() {
let cpuLimit = "Unlimited";
let memLimit = "Unlimited";
try {
const cpuData = fs.readFileSync('/sys/fs/cgroup/cpu.max', 'utf8').trim().split(' ');
if (cpuData[0] !== 'max') {
cpuLimit = (parseInt(cpuData[0]) / 1000) + " ms CPU time per 100ms";
}
} catch (err) {
console.warn("⚠️ Could not retrieve CPU limit.");
}
try {
const memData = fs.readFileSync('/sys/fs/cgroup/memory.max', 'utf8').trim();
if (memData !== 'max') {
memLimit = (parseInt(memData) / (1024 ** 2)).toFixed(2) + " MB";
}
} catch (err) {
console.warn("⚠️ Could not retrieve memory limit.");
}
return {
cpu: os.cpus()[0].model,
cores: os.cpus().length,
totalMemGB: (os.totalmem() / (1024 ** 3)).toFixed(2) + " GB",
platform: os.platform(),
release: os.release(),
containerCPU: cpuLimit,
containerMemory: memLimit
};
}
function regexMasking(query) {
return query.replace(/\b\d+\b/g, '?').replace(/(["'])(?:(?=(\\?))\2.)*?\1/g, '?');
}
function generateLargeSQLQuery(size) {
let query = "SELECT id, name, age FROM users WHERE age = 25 AND name = 'John Doe' ";
while (query.length < size) {
query += "UNION SELECT id, 'Random Name', 30 FROM users WHERE id = " + Math.floor(Math.random() * 1000) + " ";
}
return query.slice(0, size);
}
function benchmarkFunction(fn, query, iterations = 3) {
let totalTime = 0;
for (let i = 0; i < iterations; i++) {
let start = performance.now();
fn(query);
totalTime += (performance.now() - start);
}
return (totalTime / iterations).toFixed(2);
}
function benchmark() {
const sizes = [100, 1000, 10000, 50000, 100000, 1000000];
console.log("\n🔍 **System Specifications:**\n", JSON.stringify(getSystemInfo(), null, 2));
console.log("\n🚀 **Benchmark: SQL Masking Performance**");
console.log("| SQL Size | Regex (ms) |");
console.log("|----------|------------|");
for (let size of sizes) {
const query = generateLargeSQLQuery(size);
let regexTime = benchmarkFunction(regexMasking, query);
console.log(`| ${size.toLocaleString()} | ${regexTime} |`);
}
}
benchmark();
Benchmarking within a docker-container to simulate different server-enviroments:
docker run --rm -it --cpus="0.25" --memory="128m" benchmark_sql:latest
🔍 System Specifications: { "cpu": "AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics", "cores": 16, "totalMemGB": "30.00 GB", "platform": "linux", "release": "6.13.5-200.fc41.x86_64", "containerCPU": "25 ms CPU time per 100ms", "containerMemory": "128.00 MB" }
🚀 Benchmark: SQL Masking Performance
| SQL Size | Regex (ms) |
|---|---|
| 100 | 0.07 |
| 1,000 | 0.01 |
| 10,000 | 0.08 |
| 50,000 | 1.11 |
| 100,000 | 0.41 |
| 1,000,000 | 7.56 |
docker run --rm -it --cpus="0.1" --memory="50m" benchmark_sql:latest
🔍 System Specifications: { "cpu": "AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics", "cores": 16, "totalMemGB": "30.00 GB", "platform": "linux", "release": "6.13.5-200.fc41.x86_64", "containerCPU": "10 ms CPU time per 100ms", "containerMemory": "50.00 MB" }
🚀 Benchmark: SQL Masking Performance
| SQL Size | Regex (ms) |
|---|---|
| 100 | 0.06 |
| 1,000 | 0.01 |
| 10,000 | 0.07 |
| 50,000 | 0.48 |
| 100,000 | 0.38 |
| 1,000,000 | 67.37 |
docker run --rm -it --cpus="0.5" --memory="1g" benchmark_sql:latest
🔍 System Specifications: { "cpu": "AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics", "cores": 16, "totalMemGB": "30.00 GB", "platform": "linux", "release": "6.13.5-200.fc41.x86_64", "containerCPU": "50 ms CPU time per 100ms", "containerMemory": "1024.00 MB" }
🚀 Benchmark: SQL Masking Performance
| SQL Size | Regex (ms) |
|---|---|
| 100 | 0.03 |
| 1,000 | 0.01 |
| 10,000 | 0.10 |
| 50,000 | 0.56 |
| 100,000 | 0.43 |
| 1,000,000 | 5.67 |
My Take
I hear your concern about regex and replacements being costly, especially with large queries like the 20k+ character ones you mentioned. The benchmarks show that for smaller queries (up to 10k characters), the impact is negligible (<0.1 ms). For larger queries (e.g., 1M characters), it can climb to 5–67 ms depending on resources, though a 20k-character query would likely fall in the 0.1–0.2 ms range—still fast, but potentially noticeable under heavy load.
That said, I’d like to push back a bit on setting maskStatement to false by default. Data security and privacy are critical, and masking sensitive data (like numbers and strings) in outputs is a key safeguard. The performance hit, even in extreme cases, seems manageable in most realistic scenarios, and I’d argue the security benefits outweigh the cost. Turning it off by default could expose sensitive data unnecessarily, which feels riskier than a slight performance dip.
Thank you for performing the tests. Let me clarify my suggestion, I think the default should be:
- not send this data, since I agree that the security risk is a top priority
then they have the option to:
- enable to send data and with that we would have the mask of data
in the future we want to take advantage of the config file, so there people will be able to select one of the options:
- I'm already sending data without the values, so you can send as is
- I'm sending confidential data, so I want to remove everything and not mask, because my system cannot afford performance hits
- I'm sending confidential data and I want it to be masked, since my system can afford performance hits
we can even have other performance improvements, such as if query is up to X characters mask it, otherwise remove completely, or even cut to X characters and then mask it
My main concern at the moment is to make this behaviour the default, and a lot of people use ORMs and those tend to create really huge queries
I'd be curious to see the delta between the masking and sqlstring's format (which is currently used by default), given the latter also uses regex on top of some other non trivial logic. Don't disagree with the performance concerns but I have a feeling masking might not make it worse than it currently is.
Hi @maryliag @raphael-theriault-swi - just double-checking since I've passed by this PR multiple times on my ready-to-merge list:
is this PR ready to merge, or is some additional input/changes needed from @gflachs? (to match semconv guidance or other things) 🙂
It's not ready to merge.
Current behaviour: the value os maskStatement is set to default true and sanitization is being made by default.
Changes needed: the default value should be false and no information should be sent on that case. Then IF the value is true, sanitization should be done and the query text can be sent
Please read my comment for the purpose of the maskStatement flag :)
Turning it off by default could expose sensitive data unnecessarily
This is why I said that the it needs to be aligned, when the value is false, nothing it sent (so we don't potentially send sensitive information), and when the value is true, you can do the masking + sending the data.
@gflachs I think this is quite close, just needs the off-by-default comment addressed
heads up @gflachs
A recent PR that changes the folder structure has been merged. This affects your PR so you need to sync with the main branch.
Happy to help if you need it :)
I updated the default value of maskStatement to false, and made the correspondent changes in the doc and PR description. That was the only remaining change for this PR, and it should be good to merge now