integrations-core icon indicating copy to clipboard operation
integrations-core copied to clipboard

Update the list of supported ClickHouse versions to the latest and relevant metrics

Open ekopanev opened this issue 3 months ago • 13 comments

What does this PR do?

Updates the list of supported ClickHouse versions and corresponding metrics - now it can be automatically generated for all the relevant versions and metrics.

Motivation

  • currently very old and outdated versions of Clickhouse are supported
  • that's why the list of available metrics is very limited
  • it was always updated manually, so the list of metrics is also incomplete

Review checklist (to be filled by reviewers)

  • [ ] Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • [ ] Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • [ ] If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

ekopanev avatar Sep 08 '25 16:09 ekopanev

⚠️ Major version bump The changelog type changed or removed was used in this Pull Request, so the next release will bump major version. Please make sure this is a breaking change, or use the fixed or added type instead.

github-actions[bot] avatar Sep 08 '25 16:09 github-actions[bot]

Codecov Report

:x: Patch coverage is 98.13084% with 2 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 89.03%. Comparing base (dd2d491) to head (99441e5). :warning: Report is 6 commits behind head on master.

Additional details and impacted files
:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Sep 09 '25 14:09 codecov[bot]

Thanks @ekopanev for your contributions to this integration.

sangeetashivaji avatar Oct 16 '25 18:10 sangeetashivaji

@sangeetashivaji @dkirov-dd thank you for the review!

I share your main concern about the backward incompatibility of changes - so if we can't break it and get rid of really outdated and unsupported versions of ClickHouse - let me make another update and split queries/tests between old (what we have right now in the integration and which is maintained manually) and the new (which we can easily generate from the sources) metrics somehow. So we can support already existing metrics/versions, and start adding only new metrics from LTS versions in the future.

Regarding metrics naming, we followed your recommendations from there - https://docs.datadoghq.com/developers/guide/what-best-practices-are-recommended-for-naming-metrics-and-tags/#rules-and-best-practices-for-naming-metrics, which say nothing about the case of metrics, rather you support case-sensitive names. Should they be renamed then into lowercase-only?

ekopanev avatar Oct 27 '25 17:10 ekopanev

@DataDog/codex review

dkirov-dd avatar Oct 28 '25 15:10 dkirov-dd

@codex review

steveny91 avatar Oct 28 '25 18:10 steveny91

Hey @ekopanev, this PR bumps the number of metrics potentially submitted by the ClickHouse integration from 1.2k to 3.6k. This is a large increase and I wonder how that will impact the number of metrics that will hit our backend.

I have a few questions:

  1. What percentage of these metrics are always available? (in other words, how many of these metrics require a specific condition in order to be emitted by ClickHouse)
  2. How many metrics would a single instance be expected to submit at each check run? For example, is it expected that the clickhouse.asynchronous_metrics.FilesystemCacheBytes will be submitted multiple times with different tags at each run.
  3. Would it make sense to "lock" a portion of the newly added metrics behind an instance flag? This can be useful if there are metrics that are more developer-oriented or too niche for most users. I'm looking at metrics like ClickHouseProfileEvents_CoordinatedMergesMergeCoordinatorLockStateForShareMicroseconds.

dkirov-dd avatar Oct 29 '25 16:10 dkirov-dd

hi @dkirov-dd

  1. What percentage of these metrics are always available? (in other words, how many of these metrics require a specific condition in order to be emitted by ClickHouse)

Only metrics from system.metrics table will exist on 100%, and the rest is optional and depends on certain conditions (eg: perform special query, make an error, perform query in particular CSP, etc)

Currently in system_metrics.py we have 431 different metrics (also keep in mind that this is a combination of all the metrics across all supported versions).

According to the generator's script stats we have the following amount of unique metrics per table:

➜  clickhouse git:(clickhouse_integration) ✗ hatch run data:generate
The number of metrics:
- async_metrics:  124
- events:  1012
- metrics:  431
- errors:  631
Total: 2198

So only ~19.6% are always available.

  1. How many metrics would a single instance be expected to submit at each check run? For example, is it expected that the clickhouse.asynchronous_metrics.FilesystemCacheBytes will be submitted multiple times with different tags at each run.

As I mentioned above - at least 431 metrics, the rest depends on some particular condition or error.

For example, I've just started a test env

ddev env start --dev clickhouse "py3.13-25.8"

and run single check right after and got the following number of metrics:

> ddev env agent clickhouse py3.13-25.8 check

    clickhouse (5.3.1)
    ------------------
      Instance ID: clickhouse:21251c2b93536980 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/clickhouse.d/clickhouse.yaml[0]
      Total Runs: 1
      Metric Samples: Last Run: 730, Total: 730
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 1
      Average Execution Time : 157ms
      Last Execution Date : 2025-10-31 16:56:48.624 UTC (1761929808624)
      Last Successful Execution Date : 2025-10-31 16:56:48 UTC (1761929808000)

In ~10 min it's pretty the same - Metric Samples: Last Run: 739, Total: 739

And with some several consecutive queries like:

> docker exec -it clickhouse-01 clickhouse-client
...
clickhouse-01 :) select 1;
clickhouse-01 :) select 1;
clickhouse-01 :) select now();

It was just a bit more and constant - Metric Samples: Last Run: 761, Total: 761

And regarding clickhouse.asynchronous_metrics.FilesystemCacheBytes - yes it will be submitted (as well as other system resources related metrics, eg: cpu/memory/etc)

> ddev env agent clickhouse py3.13-25.8 check | grep FilesystemCacheBytes
    "metric": "clickhouse.ClickHouseAsyncMetrics_FilesystemCacheBytes",
  1. Would it make sense to "lock" a portion of the newly added metrics behind an instance flag? This can be useful if there are metrics that are more developer-oriented or too niche for most users. I'm looking at metrics like ClickHouseProfileEvents_CoordinatedMergesMergeCoordinatorLockStateForShareMicroseconds.

Do you mean just to add an extra param in the instance config to enable/disable them, so we can manage the list of performing queries in QueryManager?

If you think it is necessary we can "lock" something under instance flags.

ekopanev avatar Oct 31 '25 17:10 ekopanev

Hello @ekopanev,

Thank you for the thorough reply! This looks good to me in terms of metric volume, we won't need to create any extra parameter for enabling/disabling the collection of certain metrics.

dkirov-dd avatar Nov 03 '25 12:11 dkirov-dd

Hey @ekopanev, I'm just checking in to know if you're still interested/working on this feature.

dkirov-dd avatar Nov 24 '25 16:11 dkirov-dd

Hey @ekopanev, I'm just checking in to know if you're still interested/working on this feature.

hey @dkirov-dd ,

Sure, and I'm already working on what we've discussed above.

Could you please also give me some recommendations regarding https://github.com/DataDog/integrations-core/pull/21294#discussion_r2482232958 ? Currently I'm using the same approach as you already have for "deprecated" metrics (with 2nd query to the same table, but with different naming) - link

ekopanev avatar Nov 24 '25 16:11 ekopanev

@ekopanev good to know! I responded to your previous comment, apologies for the oversight.

dkirov-dd avatar Nov 24 '25 17:11 dkirov-dd

hey @dkirov-dd I've updated PR recently and hope addressed all the things which we discussed above, PTAL.

And here is a detailed list of changes:

  • added the latest 3 LTS ClickHouse versions (24.8, 25.3, 25.8):
    • added all possible metrics from 4 system tables: asynchronous_metrics, errors, events and metrics - based on ClickHouse source code through code generation script
    • new metrics prefixes - now they contain corresponding system table name in the name: clickhouse.<system-table>.<original-metric-name>
    • added 24.8, 25.3, 25.8 versions for integration and e2e tests
    • they were split between current ones and named "advanced" while the rest "legacy" (cause it's a mix of old/new metrics across different ClickHouse versions which are testing currently on old list of versions 18-19-20-21.8-22.7) - can be configured via use_legacy_queries and use_advanced_queries instance settings (both true by default)
    • added "conditional" queries for the agent, because system.errors table was introduced only in 20.11 - and we can't use it for with old ClickHouse versions
  • default dashboard was updated based on the "advanced" metrics, added lots of new panels/metrics
  • logs template was updated with JSON logs format - now it supports both plain and JSON logs formats
  • docker compose setup was simplified for integration/e2e tests: now it uses only 1 ClickHouse and 1 Keeper/Zookeeper node - Keeper setup should work faster and more stable
  • metadata file was also split into 2: metadata-legacy.csv‎ with the original legacy metrics (added and maintained manually) - and it will be merged into metadata.csv‎ with all new metrics from the metrics generator

ekopanev avatar Dec 09 '25 13:12 ekopanev