yugabyte-db [DocDB] Prometheus Scrape Timeout Exceeds 15 Seconds with 18,000 Tablets

Description

We have observed that the Prometheus metric scraping taking more than 10 seconds when there are 3000 tables and 18000 tablets, which causes the Prometheus target to frequently go down. A sample output includes:

    "scrapePool": "yugabyte",
    "scrapeUrl": "http://<address>:9000/prometheus-metrics?priority_regex=......",
    "lastError": "Get \"http://<address>:9000/prometheus-metrics?priority_regex=......&show_help=false\": context deadline exceeded",
    "lastScrapeDuration": 10.00115163,
    "health": "down",
    "scrapeInterval": "10s",
    "scrapeTimeout": "10s"

The lastError shows a "context deadline exceeded" message, indicating a timeout issue when scraping metrics , and "health=down" means the Prometheus target is down.

Issue Type

kind/enhancement

Warning: Please confirm that this issue does not contain any sensitive information

[X] I confirm this issue does not contain any sensitive information.

Oct 11 '24 21:10 yusong-yan

Tested on tserver with 4000 tables, and 14000 tablets.

Oct 11 '24 22:10 yusong-yan

Block exporting metrics at both the server and table levels(empty output) took approximately 3 seconds to complete using the following URL parameters:/prometheus-metrics?show_help=false&version=v2&table_blocklist=ALL&server_blocklist=ALL

Oct 11 '24 22:10 yusong-yan

The PrometheusWriter::WriteSingleEntry map allocation stack from the above flamegraph originates from this code:

    MetricEntity::AttributeMap new_attr = attr;
    new_attr.erase("table_id");
    new_attr.erase("table_name");
    new_attr.erase("table_type");
    new_attr.erase("namespace_name");

This GitHub issue will track the commit that addresses this specific performance issue. Further improvements to metric scraping performance are being tracked in https://github.com/yugabyte/yugabyte-db/issues/24565.

Oct 22 '24 17:10 yusong-yan

With this optimization, On a 4-core node with 4,000 tables and 18,000 tablets, the scraping time for normal mode reduced from 18 seconds to 13 seconds.

Feb 19 '25 00:02 rthallamko3

yugabyte-db yugabyte-db copied to clipboard

[DocDB] Prometheus Scrape Timeout Exceeds 15 Seconds with 18,000 Tablets

Description

Issue Type

Warning: Please confirm that this issue does not contain any sensitive information

yugabyte-db
yugabyte-db copied to clipboard