scylla-monitoring Latency aggregates produce confusing values for percentiles

Installation details Panel Name: Latencies Dashboard Name: Detailed Scylla-Monitoring Version: 4.3.4 Scylla-Version: 2021.1.12 (this is most likely irrelevant)

Description rlatencypXX/wlatencypXX aggregates sometimes produce a very confusing result due to internal calculations error. As a result values are sometimes shifted, sometimes are dramatically different from what a histogram_quantile function over a raw metric returns:

These are aggregates values:

And here are values calculated by a histogram_quantile alongside the values above (notice the shift):

Here is a different example when the percentile values were significantly lower with aggregates than with a raw metric (which is the correct value). Notice a few orders of magnitude difference!

(aggregates only)

(aggregates + raw value):

These "artifacts" make the debugging using Monitoring much harder than needed. It's impossible to rely on graphs given such huge errors.

I suggest to stop using those aggregates and get back to using histogram_quantile directly.

The reason we started using those aggregates was mainly due to a huge load histogram_quantile were creating while calculating values for internal scheduling groups. Since we are filtering those out now there is a good chance we don't need to aggregate on Prometheus anymore.

May 24 '23 18:05 vladzcloudius

cc @mykaul

May 24 '23 18:05 vladzcloudius

cc @tomer-sandler @harel-z @gcarmin

May 24 '23 19:05 vladzcloudius

@amnonh - what's the next step here?

Jun 13 '23 18:06 mykaul

Wait for 5.3 and 2023.1 that should change how latency is calculated, and then decide

Jun 13 '23 19:06 amnonh

Wait for 5.3 and 2023.1 that should change how latency is calculated, and then decide

We need a solution for 2022.x, @amnonh. We can't wait for 2023.1.

Wouldn't filtering on a Monitoring level of "trash scheduling classes" histograms give us enough headroom to not have to use these aggregates?

Jun 13 '23 19:06 vladzcloudius

@vladzcloudius let me try to explain. Calculating quantiles from histograms over multiple histograms and a long duration is enough to crash Prometheus. We've seen that in the past. To overcome this, we are using recording rules to calculate smaller chuck incrementally.

That moved the problem to the recording rules calculation part. For 2022.x we address it in two ways. One, reduce the recording rules calculation interval, and two in 4.4.x drop the internal latency histograms from the calculation.

In 5.3 and 2023.1 most of those calculations will be moved to Scylla, so I expect great improvement, but we will need to see it live to be sure.

Right now, if you have a cluster that doesn't have millions of metrics (you can verify that by looking at http://{ip}:9090/tsdb-status) and that the recording rules calculation does not take too long (you can verify that by looking at http://{ip}:9090/rules) you can reduce the evaluation_interval to 20s, that would make the graphs look nicer.

There are two ways of changing that, one (I think is simpler) edit prometheus/prometheus.yml.template and change evaluation_interval to 20s.

Alternatively, you can use a command line option to start-all.sh: --evaluation-interval 20s

That would, of course, increase the load on the Prometheus server, so if done, the server should be monitored to see that it can handle the load.

I would suggest dropping cas and cdc if possible. And of course, I'm available to try it together

Jun 13 '23 20:06 amnonh

@amnonh there must be a typo in your patch or something. I don't see how #2022 fixes this issue. Reopening

Jul 21 '23 22:07 vladzcloudius

@vladzcloudius Can you verify that the issue is solved with ScyllaDB > 2023.1 and setting the recording rule calculation to 20s?

Oct 26 '23 14:10 amnonh

@vladzcloudius ping

Feb 08 '24 09:02 amnonh

@vladzcloudius ping

Feb 20 '24 09:02 amnonh

@vladzcloudius Can you verify that the issue is solved with ScyllaDB > 2023.1 and setting the recording rule calculation to 20s?

Why do we need to increase recording rules calculation to 20s with 2023.1, @amnonh?

Feb 20 '24 21:02 vladzcloudius

@vladzcloudius It's decreasing. If it's lower than 20s, it's fine, but before 2023.1, it used to be 1m, and it is the root cause of the issue.

Post 2023.1 with low recording rule calculation I expect the issue will be solved

Feb 21 '24 07:02 amnonh

@vladzcloudius It's decreasing. If it's lower than 20s, it's fine, but before 2023.1, it used to be 1m, and it is the root cause of the issue.

Post 2023.1 with low recording rule calculation I expect the issue will be solved

So no need to change the recording rule calculation with 2023.1, right?

Feb 21 '24 15:02 vladzcloudius

You need to change evaluation_interval to 20s, either edit prometheus/prometheus.yml.template

$ head -4 prometheus/prometheus.yml.template
global:
  scrape_interval: 20s # By default, scrape targets every 20 second.
  scrape_timeout: 15s # Timeout before trying to scrape a target again
  evaluation_interval: 20s # <--- This used to be 60s it should be 20s

or using the --evaluation-interval command line parameter to start-all.sh

Next scylla-monitoring release will have it set to 20s by default.

Feb 21 '24 16:02 amnonh

@vladzcloudius ping

Mar 06 '24 08:03 amnonh

@vladzcloudius ping, can you see if things as expected with 2024.1 and the latest monitoring release?

May 12 '24 12:05 amnonh

@vladzcloudius I'm closing is completed, please re-open if there's still an issue

May 15 '24 14:05 amnonh

scylla-monitoring scylla-monitoring copied to clipboard

Latency aggregates produce confusing values for percentiles

scylla-monitoring
scylla-monitoring copied to clipboard