micrometer icon indicating copy to clipboard operation
micrometer copied to clipboard

Timer/distribution summary snapshot values not self-consistent

Open john-karp opened this issue 6 years ago • 15 comments

While the name might suggest otherwise, the timer/distribution summary takeSnapshot() implementations do not atomically capture statistics, so that count(), max(), and histogramCounts() are not necessarily self-consistent. This can lead to situations where the total counts of the histogram buckets may not equal snapshot.count(). Or a downstream system using metrics might compute incorrect rates (total / count) because the time periods they measured are effectively different.

john-karp avatar Nov 09 '18 22:11 john-karp

For us this is not are rare case, it happens more or less all the time so at the moment this is a blocker for us. Are there any thought about if and how this could be solved? Would be great if you could point me in the right direction.

jrask avatar Dec 15 '18 08:12 jrask

Without thinking too much about it, seems like in the presence of histograms you could just take the histogram sum as the count. Neither value is more correct really.

jkschneider avatar Dec 15 '18 22:12 jkschneider

Thanks, so there is no plan or proposal how to solve this for "real"? My guess was that this was a concurrency issue. I could look into it, I was just hoping to perhaps get some initial intro to the problem.

What we are considering is that there are two scenarios, one which is ok and that is when we have durations that are longer than our max SLA. In that case it would be ok with Count higher than CountAtBucket[lastValue].count() , right?

So would a workaround like this be good enough?

timer.count > CountAtBucket[lastValue] && Timer.max <= SLA.Max
== Rely on CountAtBucket[lastValue] as Count. Do nothing else.

timer.count > CountAtBucket[lastValue] && Timer.max > SLA.Max
== Rely on Count as Count, add an extra bucket with Timer.count -  CountAtBucket[lastValue].count
      which is from SLA_MAX TO inf.

jrask avatar Dec 17 '18 08:12 jrask

I think we could always rely on the count on a +Inf bucket to be the count when histograms are being recorded. They can only diverge now because count is maintained independently of histogram counts, so if a snapshot is taken between when count has been updated and the histogram is updated, they can appear inconsistent. Neither value is "more correct" in absolute terms, but consistency within a snapshot is preferable.

jkschneider avatar Dec 18 '18 15:12 jkschneider

Hi, I just have to leave a comment on this again since I really do not understand what happens. This is a result of a Timer being invoked with record() every 100ms and then we have steps=2s (have tried with different intervals, same result). Then everything just runs in a while true loop.

The result shows a big difference between histogram.count and histogram cumulative count. Is this really just "an atomic capture" issue? Anyway, if we are not able to get this to work properly I do not really know what to do. This is not a rare situation, it happens all the time.

It is not really useful to use count=13 men the reality is 20?

Start

Histogram max value jumps between 20 and 13 and than stabalizes around 13-14.

Timer.cnt=20, Histogram.cnt=13, HistogramSnapshot{count=20, total=4.667E10, mean=2.3335E9, max=4.866E9, percentileValues=[(2.277507072E9 at 50.0%), (2.948595712E9 at 75.0%), (4.559208448E9 at 95.0%), (4.827643904E9 at 99.0%)], histogramCounts=[(4.0 at 1000000000), (13.0 at 5000000000), (13.0 at 7000000000)]}

Timer.cnt=20, HistogramSnapshot{count=20, total=4.667E10, mean=2.3335E9, max=4.866E9, percentileValues=[(2.07618048E9 at 50.0%), (2.545942528E9 at 75.0%), (4.290772992E9 at 95.0%), (4.290772992E9 at 99.0%)], histogramCounts=[(4.0 at 1000000000), (19.0 at 5000000000), (19.0 at 7000000000)]}

Timer.count=19, HistogramSnapshot{count=19, total=5.8642E10, mean=3.086421052631579E9, max=4.772E9, percentileValues=[(2.948595712E9 at 50.0%), (4.559208448E9 at 75.0%), (4.827643904E9 at 95.0%), (4.827643904E9 at 99.0%)], histogramCounts=[(1.0 at 1000000000), (19.0 at 5000000000), (19.0 at 7000000000)]}

Timer.cnt=19, HistogramSnapshot{count=19, total=4.2482E10, mean=2.2358947368421054E9, max=4.606E9, percentileValues=[(2.277507072E9 at 50.0%), (3.08281344E9 at 75.0%), (4.290772992E9 at 95.0%), (4.827643904E9 at 99.0%)], histogramCounts=[(3.0 at 1000000000), (19.0 at 5000000000), (19.0 at 7000000000)]}

Timer.cnt=20, HistogramSnapshot{count=20, total=5.2304E10, mean=2.6152E9, max=4.926E9, percentileValues=[(2.948595712E9 at 50.0%), (3.619684352E9 at 75.0%), (4.156555264E9 at 95.0%), (5.09607936E9 at 99.0%)], histogramCounts=[(1.0 at 1000000000), (13.0 at 5000000000), (13.0 at 7000000000)]}

Timer.cnt=19, HistogramSnapshot{count=19, total=4.3376E10, mean=2.2829473684210525E9, max=4.933E9, percentileValues=[(2.950692864E9 at 50.0%), (3.621781504E9 at 75.0%), (4.5613056E9 at 95.0%), (5.098176512E9 at 99.0%)], histogramCounts=[(4.0 at 1000000000), (13.0 at 5000000000), (13.0 at 7000000000)]}

However, one time I forgot to stop the program and when I came back I noticed that the histogram max count was slow but constantly increasing and reach the following which looks promising, it almost reaches total count but is 19 or 20.

Timer.cnt=20, HistogramSnapshot{count=20, total=4.3065E10, mean=2.15325E9, max=4.673E9, percentileValues=[(2.415886336E9 at 50.0%), (3.355410432E9 at 75.0%), (4.1607168E9 at 95.0%), (4.83180544E9 at 99.0%)], histogramCounts=[(5.0 at 1000000000), (19.0 at 5000000000), (19.0 at 7000000000)]}

Timer.cnt=20, HistogramSnapshot{count=20, total=4.4401E10, mean=2.22005E9, max=4.943E9, percentileValues=[(1.946124288E9 at 50.0%), (4.026499072E9 at 75.0%), (4.83180544E9 at 95.0%), (5.100240896E9 at 99.0%)], histogramCounts=[(8.0 at 1000000000), (20.0 at 5000000000), (20.0 at 7000000000)]}

Timer.cnt=20, HistogramSnapshot{count=20, total=5.1766E10, mean=2.5883E9, max=4.94E9, percentileValues=[(2.550104064E9 at 50.0%), (3.48962816E9 at 75.0%), (5.100240896E9 at 95.0%), (5.100240896E9 at 99.0%)], histogramCounts=[(4.0 at 1000000000), (20.0 at 5000000000), (20.0 at 7000000000)]}

Timer.cnt=19, HistogramSnapshot{count=19, total=4.4872E10, mean=2.3616842105263157E9, max=4.977E9, percentileValues=[(2.013233152E9 at 50.0%), (4.026499072E9 at 75.0%), (4.83180544E9 at 95.0%), (5.100240896E9 at 99.0%)], histogramCounts=[(5.0 at 1000000000), (19.0 at 5000000000), (19.0 at 7000000000)]}

And all of a sudden drop to 13, 14 again.

jrask avatar Jan 21 '19 14:01 jrask

@jkschneider Hi again, it really seems like timing is everything here. I changed this gist https://gist.github.com/jrask/52c70475ac744af56f88e43fab80a7c9 and just added

 for (Meter m : getMeters()) {
        if (m instanceof Timer) {
            Thread.sleep(20);
            Timer t = (Timer)m;
            HistogramSnapshot histogramSnapshot = t.takeSnapshot();
            System.out.println(t.count() + ", " + histogramSnapshot);
     ...
}

before handling each meter (I only have one). This changed the output from proper results to minor difference. In my code, since I things like custom json generation I can image that 20ms can pass from the time that publish() is invoked until I use it. (Note, this is not related to my comment above, that is another test case, this is "raw" micrometer).

Increasing sleep does not seem to affect much.

Anyway, perhaps you already knew this but for me this proves that it is broken. Not really sure what to do next, if this is solvable. It is "too" wrong to be useful, cannot rely on milliseconds to be on my side.

No sleep

29, HistogramSnapshot{count=29, total=4.35E9, mean=1.5E8, max=1.5E8, histogramCounts=[(0.0 at 100000000), (29.0 at 200000000), (29.0 at 500000000)]}
30, HistogramSnapshot{count=30, total=4.5E9, mean=1.5E8, max=1.5E8, histogramCounts=[(0.0 at 100000000), (30.0 at 200000000), (30.0 at 500000000)]}

To this, where we loose around -10 in count for histogram buckets total.

With sleep(20)

30, HistogramSnapshot{count=30, total=4.5E9, mean=1.5E8, max=1.5E8, histogramCounts=[(0.0 at 100000000), (20.0 at 200000000), (20.0 at 500000000)]}
29, HistogramSnapshot{count=29, total=4.35E9, mean=1.5E8, max=1.5E8, histogramCounts=[(0.0 at 100000000), (29.0 at 200000000), (29.0 at 500000000)]}

jrask avatar Jan 21 '19 15:01 jrask

Hi, I've faced this issue as well I found it is not self-consistent not only between SLO value and total counter. Here you can see query to return percent in the specific bucket. image but between boundaries as well: image

I've played with aggregations in my queries and found that it is not because of that and kept for the screenshots the most "self-descriptive" ones. I've played with aggregations in my queries and found that it is not because of that and used for the screenshots the most "self-descriptive" ones.

From what I see from my dashboards, the most fluctuations are near the last two buckets.

I even create a "middle" bucket (should be without data) and don't see such behavior.

Fetsivalen avatar Dec 04 '20 16:12 Fetsivalen

Also, I've played with Heatmaps visualization for the latency and realized that the same stuff happens with percentiles as well as with bucket values.

Fetsivalen avatar Dec 04 '20 16:12 Fetsivalen

Hello, I noticed a problems in 2 of our applications using StackdriverMeterRegistry from micrometer 1.6.6. The errors are like this: com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Field timeSeries[6].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 1 which does not equal the |count| field value of 0.

I believe it is related to Summary snapshot inconsistency. Maybe for many systems it is not verified, but for Stackdriver such a metric is rejected.

I can see that this issue is not of high priority since it has been open for 3 years - can we count on a higher priority?

rafal-dudek avatar Oct 29 '21 11:10 rafal-dudek

Hello, we got this error too in our grafana dashboard, we get negatives value in the +inf bucket causing us to think that we are getting very high responses times for our requests:

image

associated heatmap query is :

sum by (le)(rate(http_server_requests_seconds_bucket{namespace="$namespace", job="$job",method=~"$method",uri=~"$uri",status=~"$status",le=~"0.001|0.003|0.03|0.1|0.3|0.5|1.0|1.5|3.0|5.0|10.0|.Inf}[$__rate_interval]))

cschockaert avatar Oct 31 '21 20:10 cschockaert

Hello, I noticed a problems in 2 of our applications using StackdriverMeterRegistry from micrometer 1.6.6. The errors are like this: com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Field timeSeries[6].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 1 which does not equal the |count| field value of 0.

I believe it is related to Summary snapshot inconsistency. Maybe for many systems it is not verified, but for Stackdriver such a metric is rejected.

I can see that this issue is not of high priority since it has been open for 3 years - can we count on a higher priority?

Interestingly, running on Micrometer 1.10.x we are not troubled by this particular error (i.e. there is no complaints in the logs), but as soon as we upgrade to 1.11.x or 1.12.x we will see warnings similar to this one in our logs.

marthursson avatar Jan 24 '24 12:01 marthursson

We're seeing the same problem with Micrometer 1.11 - not sure if consistency is a problem here, but noticed that the |count| field is pretty much always 0 while bucket counts aren't.

Trying to debug the code, I'm consistently seeing the count in the underlying StepTimer to be always 0 when the summary is calculated. A potential workaround would be to disregard the count when creating the distribution for Stackdriver (although looking at the issue this may not be related just to Stackdriver.)

gmucha avatar Feb 06 '24 11:02 gmucha

We're seeing the same problem with Micrometer 1.11 - not sure if consistency is a problem here, but noticed that the |count| field is pretty much always 0 while bucket counts aren't.

Trying to debug the code, I'm consistently seeing the count in the underlying StepTimer to be always 0 when the summary is calculated. A potential workaround would be to disregard the count when creating the distribution for Stackdriver (although looking at the issue this may not be related just to Stackdriver.)

Interestingly, quickly looking at the code there is a lot of work involved to calculate the counts of the buckets. I may be missing some important aspect here (I've basically just looked at this particular function) but since the individual counts are supposed (by the API) to sum up to the total count it should be possible to just calculate the total count in the same function rather than accepting the total count as a parameter to the function.

marthursson avatar Feb 06 '24 11:02 marthursson

We have been observing same issue ever since we migrated to Spring Boot 3.2.2 (Micrometer 1.12).

com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: One or more TimeSeries could not be written: Field timeSeries[31].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 4 which does not equal the |count| field value of 2.

Any update on this issue?

altunkan avatar Mar 21 '24 18:03 altunkan

Hello,

More and more applications in our company have problems with it. E.g.: Micrometer 1.12.0

failed to send metrics to Stackdriver
com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: One or more TimeSeries could not be written: Field timeSeries[82].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 2 which does not equal the |count| field value of 0.;
 Field timeSeries[97].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 2 which does not equal the |count| field value of 0.;
 Field timeSeries[39].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 1 which does not equal the |count| field value of 0.;
 Field timeSeries[5].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 2 which does not equal the |count| field value of 0.;
 Field timeSeries[131].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 1 which does not equal the |count| field value of 0.;
 Field timeSeries[162].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 2 which does not equal the |count| field value of 0.;
 Field timeSeries[196].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 2 which does not equal the |count| field value of 0.

rafal-dudek avatar May 13 '24 06:05 rafal-dudek