Update the list of supported ClickHouse versions to the latest and relevant metrics
What does this PR do?
Updates the list of supported ClickHouse versions and corresponding metrics - now it can be automatically generated for all the relevant versions and metrics.
Motivation
- currently very old and outdated versions of Clickhouse are supported
- that's why the list of available metrics is very limited
- it was always updated manually, so the list of metrics is also incomplete
Review checklist (to be filled by reviewers)
- [ ] Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
- [ ] Add the
qa/skip-qalabel if the PR doesn't need to be tested during QA. - [ ] If you need to backport this PR to another branch, you can add the
backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged
⚠️ Major version bump
The changelog type changed or removed was used in this Pull Request, so the next release will bump major version. Please make sure this is a breaking change, or use the fixed or added type instead.
Codecov Report
:x: Patch coverage is 98.13084% with 2 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 89.03%. Comparing base (dd2d491) to head (99441e5).
:warning: Report is 6 commits behind head on master.
Additional details and impacted files
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
Thanks @ekopanev for your contributions to this integration.
@sangeetashivaji @dkirov-dd thank you for the review!
I share your main concern about the backward incompatibility of changes - so if we can't break it and get rid of really outdated and unsupported versions of ClickHouse - let me make another update and split queries/tests between old (what we have right now in the integration and which is maintained manually) and the new (which we can easily generate from the sources) metrics somehow. So we can support already existing metrics/versions, and start adding only new metrics from LTS versions in the future.
Regarding metrics naming, we followed your recommendations from there - https://docs.datadoghq.com/developers/guide/what-best-practices-are-recommended-for-naming-metrics-and-tags/#rules-and-best-practices-for-naming-metrics, which say nothing about the case of metrics, rather you support case-sensitive names. Should they be renamed then into lowercase-only?
@DataDog/codex review
@codex review
Hey @ekopanev, this PR bumps the number of metrics potentially submitted by the ClickHouse integration from 1.2k to 3.6k. This is a large increase and I wonder how that will impact the number of metrics that will hit our backend.
I have a few questions:
- What percentage of these metrics are always available? (in other words, how many of these metrics require a specific condition in order to be emitted by ClickHouse)
- How many metrics would a single instance be expected to submit at each check run? For example, is it expected that the
clickhouse.asynchronous_metrics.FilesystemCacheByteswill be submitted multiple times with different tags at each run. - Would it make sense to "lock" a portion of the newly added metrics behind an instance flag? This can be useful if there are metrics that are more developer-oriented or too niche for most users. I'm looking at metrics like
ClickHouseProfileEvents_CoordinatedMergesMergeCoordinatorLockStateForShareMicroseconds.
hi @dkirov-dd
- What percentage of these metrics are always available? (in other words, how many of these metrics require a specific condition in order to be emitted by ClickHouse)
Only metrics from system.metrics table will exist on 100%, and the rest is optional and depends on certain conditions (eg: perform special query, make an error, perform query in particular CSP, etc)
Currently in system_metrics.py we have 431 different metrics (also keep in mind that this is a combination of all the metrics across all supported versions).
According to the generator's script stats we have the following amount of unique metrics per table:
➜ clickhouse git:(clickhouse_integration) ✗ hatch run data:generate
The number of metrics:
- async_metrics: 124
- events: 1012
- metrics: 431
- errors: 631
Total: 2198
So only ~19.6% are always available.
- How many metrics would a single instance be expected to submit at each check run? For example, is it expected that the clickhouse.asynchronous_metrics.FilesystemCacheBytes will be submitted multiple times with different tags at each run.
As I mentioned above - at least 431 metrics, the rest depends on some particular condition or error.
For example, I've just started a test env
ddev env start --dev clickhouse "py3.13-25.8"
and run single check right after and got the following number of metrics:
> ddev env agent clickhouse py3.13-25.8 check
clickhouse (5.3.1)
------------------
Instance ID: clickhouse:21251c2b93536980 [OK]
Configuration Source: file:/etc/datadog-agent/conf.d/clickhouse.d/clickhouse.yaml[0]
Total Runs: 1
Metric Samples: Last Run: 730, Total: 730
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 1
Average Execution Time : 157ms
Last Execution Date : 2025-10-31 16:56:48.624 UTC (1761929808624)
Last Successful Execution Date : 2025-10-31 16:56:48 UTC (1761929808000)
In ~10 min it's pretty the same - Metric Samples: Last Run: 739, Total: 739
And with some several consecutive queries like:
> docker exec -it clickhouse-01 clickhouse-client
...
clickhouse-01 :) select 1;
clickhouse-01 :) select 1;
clickhouse-01 :) select now();
It was just a bit more and constant - Metric Samples: Last Run: 761, Total: 761
And regarding clickhouse.asynchronous_metrics.FilesystemCacheBytes - yes it will be submitted (as well as other system resources related metrics, eg: cpu/memory/etc)
> ddev env agent clickhouse py3.13-25.8 check | grep FilesystemCacheBytes
"metric": "clickhouse.ClickHouseAsyncMetrics_FilesystemCacheBytes",
- Would it make sense to "lock" a portion of the newly added metrics behind an instance flag? This can be useful if there are metrics that are more developer-oriented or too niche for most users. I'm looking at metrics like ClickHouseProfileEvents_CoordinatedMergesMergeCoordinatorLockStateForShareMicroseconds.
Do you mean just to add an extra param in the instance config to enable/disable them, so we can manage the list of performing queries in QueryManager?
If you think it is necessary we can "lock" something under instance flags.
Hello @ekopanev,
Thank you for the thorough reply! This looks good to me in terms of metric volume, we won't need to create any extra parameter for enabling/disabling the collection of certain metrics.
Hey @ekopanev, I'm just checking in to know if you're still interested/working on this feature.
Hey @ekopanev, I'm just checking in to know if you're still interested/working on this feature.
hey @dkirov-dd ,
Sure, and I'm already working on what we've discussed above.
Could you please also give me some recommendations regarding https://github.com/DataDog/integrations-core/pull/21294#discussion_r2482232958 ? Currently I'm using the same approach as you already have for "deprecated" metrics (with 2nd query to the same table, but with different naming) - link
@ekopanev good to know! I responded to your previous comment, apologies for the oversight.
hey @dkirov-dd I've updated PR recently and hope addressed all the things which we discussed above, PTAL.
And here is a detailed list of changes:
- added the latest 3 LTS ClickHouse versions (24.8, 25.3, 25.8):
- added all possible metrics from 4 system tables:
asynchronous_metrics,errors,eventsandmetrics- based on ClickHouse source code through code generation script - new metrics prefixes - now they contain corresponding system table name in the name:
clickhouse.<system-table>.<original-metric-name> - added 24.8, 25.3, 25.8 versions for integration and e2e tests
- they were split between current ones and named "advanced" while the rest "legacy" (cause it's a mix of old/new metrics across different ClickHouse versions which are testing currently on old list of versions 18-19-20-21.8-22.7) - can be configured via
use_legacy_queriesanduse_advanced_queriesinstance settings (bothtrueby default) - added "conditional" queries for the agent, because
system.errorstable was introduced only in 20.11 - and we can't use it for with old ClickHouse versions
- added all possible metrics from 4 system tables:
- default dashboard was updated based on the "advanced" metrics, added lots of new panels/metrics
logstemplate was updated with JSON logs format - now it supports bothplainandJSONlogs formats- docker compose setup was simplified for integration/e2e tests: now it uses only 1 ClickHouse and 1 Keeper/Zookeeper node - Keeper setup should work faster and more stable
- metadata file was also split into 2:
metadata-legacy.csvwith the original legacy metrics (added and maintained manually) - and it will be merged intometadata.csvwith all new metrics from the metrics generator