opentelemetry-collector-contrib icon indicating copy to clipboard operation
opentelemetry-collector-contrib copied to clipboard

[receiver/prometheusreceiver] implement append native histogram

Open krajorama opened this issue 1 year ago • 20 comments

Description:

Implement native histogram append MVP. Very similar to appending a float sample.

Limitations:

  • Only support integer counter histograms fully.
  • ZeroThreshold is currently not possible to set in the OTEL model, so it always remains at the default. (Need work in OTEL).
  • In case a histogram has both classic and native buckets, we only store one of them. Governed by scrape_classic_histograms scrape option. The reason is that in the OTEL model the metric family is identified by the normalized name (without _count, _sum, _bucket suffixes for the classic histograms), meaning that the classic and native histograms would map to the same metric family in OTEL model , but that cannot have both Histogram and ExponentialHistogram types at the same time.
  • Gauge histograms are dropped with warning as that temporality is unsupported, see https://github.com/open-telemetry/opentelemetry-specification/issues/2714
  • NoRecordedValue attribute might be unreliable. Prometheus scrape marks all series with float NaN values when stale, but transactions in prometheusreceiver are stateless, meaning that we have to use heuristics to figure out if we need to add a NoRecordedValue data point to an Exponential Histogram metric. (Need work in Prometheus.)

Additionally:

  • Created timestamp supported.
  • Float counter histograms not fully tested and lose precision, but we don't expect instrumentation to expose these anyway.

Link to tracking Issue:

Fixes: #26555

Testing:

Added unit tests and e2e tests.

Documentation:

TBD: will have to call out protobuf negotiation while no text format. #27030

krajorama avatar Oct 27 '23 12:10 krajorama

Seems like I need to create a new metricFamily that has type MetricTypeExponentialHistogram .

The only issue seems like that Prometheus and Grafana Agent supports scraping classic and native (exponential) histogram version of the same metric, making the key (which is the normalized metric name) the same for the classic and native histogram. So I'm not totally sure what to do here. Maybe reuse the metric family, but make it possible to have a metricGroup inside that has the exponential histogram. This would be possible to route for prometheus remote write exporter, but I doubt it would work for otel.

Or MVP: do not enable scraping classic histograms when there is native histogram. Will not cause an issue for OTEL. Only people that would get bit by this is someone trying to switch from prometheus to otel but using classic + native histograms at the same time (i.e. migrating).

krajorama avatar Oct 30 '23 17:10 krajorama

do not enable scraping classic histograms when there is native histogram. Will not cause an issue for OTEL. Only people that would get bit by this is someone trying to switch from prometheus to otel but using classic + native histograms at the same time (i.e. migrating).

Could we add config to control whether we keep the native vs the fixed bucket histogram?

Is there any existing configuration in the prometheus server that controls whether native histograms are enabled or not?

dashpole avatar Oct 30 '23 18:10 dashpole

do not enable scraping classic histograms when there is native histogram. Will not cause an issue for OTEL. Only people that would get bit by this is someone trying to switch from prometheus to otel but using classic + native histograms at the same time (i.e. migrating).

Could we add config to control whether we keep the native vs the fixed bucket histogram?

Is there any existing configuration in the prometheus server that controls whether native histograms are enabled or not?

prometheus will start scraping native histograms only if you enable the feature. And you can ask it to also scrape classic version of the native histograms (since the metric names don't conflict) see scrape_classic_histograms in scrape config.

krajorama avatar Nov 03 '23 15:11 krajorama

Ideally we would only scrape native histograms if the feature flag is enabled, and only scrape classic histograms if scrape_classic_histograms is set.

dashpole avatar Nov 03 '23 17:11 dashpole

@dashpole I've opened another PR #29153 to set up for testing native histogram scrape here e2e, ptal.

krajorama avatar Nov 13 '23 22:11 krajorama

Rebased onto #29153 and added first e2e test.

krajorama avatar Nov 14 '23 08:11 krajorama

@dashpole @bogdandrutu I'm currently looking at func (a *initialPointAdjuster) adjustMetricHistogram for adding func (a *initialPointAdjuster) adjustMetricExponentialHistogram , but I'm a little confused at how good this has to be?

I mean it's not possible to 100% tell from a histogram if it was restarted between two scrapes.

A possible bug in the current code is checking the Sum , since we can observe negative values that will reduce the sum. On the other hand it is possible that the Count increases, but there's actually a reduction in individual buckets, so in theory the code should be looking at bucket counts.

krajorama avatar Nov 14 '23 16:11 krajorama

IIUC, the sum should not be reported if it includes negative observations (at least for "standard" histograms, since sum is a counter). https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#histogram. Is sum not required to be monotonic for native histograms?

On the other hand it is possible that the Count increases, but there's actually a reduction in individual buckets, so in theory the code should be looking at bucket counts.

Can you clarify? Are you talking about buckets being merged? Or about the histogram resetting, but receiving more observations than it previously had since the reset?

dashpole avatar Nov 14 '23 16:11 dashpole

IIUC, the sum should not be reported if it includes negative observations (at least for "standard" histograms, since sum is a counter). https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#histogram. Is sum not required to be monotonic for native histograms?

I made a simple test with observing some negative values and checked /metrics:

# HELP golang_manual_histogram This is a histogram with manually selected parameters
# TYPE golang_manual_histogram histogram
golang_manual_histogram_bucket{address="0.0.0.0",generation="20",port="5001",le="-5"} 1
golang_manual_histogram_bucket{address="0.0.0.0",generation="20",port="5001",le="-1"} 2
golang_manual_histogram_bucket{address="0.0.0.0",generation="20",port="5001",le="1"} 3
golang_manual_histogram_bucket{address="0.0.0.0",generation="20",port="5001",le="5"} 3
golang_manual_histogram_bucket{address="0.0.0.0",generation="20",port="5001",le="+Inf"} 3
golang_manual_histogram_sum{address="0.0.0.0",generation="20",port="5001"} -12
golang_manual_histogram_count{address="0.0.0.0",generation="20",port="5001"} 3

To quote the documentation : "The sum of observations (showing up as a time series with a _sum suffix) behaves like a counter, too, as long as there are no negative observations. "

So there is no guarantee that the sum actually behaves like a counter.

On the other hand it is possible that the Count increases, but there's actually a reduction in individual buckets, so in theory the code should be looking at bucket counts.

Can you clarify? Are you talking about buckets being merged? Or about the histogram resetting, but receiving more observations than it previously had since the reset?

The second case. Let's suppose you have 4 observations in the 0th offset positive bucket and that's the only bucket in use, so your count is 4. For the next scrape , schema doesn't change , and you get 2 observations in the 0th, 1st, 2nd positive buckets, so your overall count is 6, but obviously there was a restart since 0th bucket count went down.

See code starting from https://github.com/prometheus/prometheus/blob/1bfb3ed062e99bd3c74e05d9ff9a7fa4e30bbe21/tsdb/chunkenc/histogram.go#L272 .

But this is all academic if it's not actually that important to detect these cases :)

krajorama avatar Nov 14 '23 16:11 krajorama

The second case. Let's suppose you have 4 observations in the 0th offset positive bucket and that's the only bucket in use, so your count is 4. For the next scrape , schema doesn't change , and you get 2 observations in the 0th, 1st, 2nd positive buckets, so your overall count is 6, but obviously there was a restart since 0th bucket count went down.

Those cases seem rare in practice, and we've accepted those thus far as unsolvable. I'm OK with not properly handling those.

It sounds like using count to detect resets is probably our best bet.

IIRC, prom protobuf is going to begin include start timestamps soon, so hopefully these problems are obsoleted over time.

dashpole avatar Nov 14 '23 16:11 dashpole

The second case. Let's suppose you have 4 observations in the 0th offset positive bucket and that's the only bucket in use, so your count is 4. For the next scrape , schema doesn't change , and you get 2 observations in the 0th, 1st, 2nd positive buckets, so your overall count is 6, but obviously there was a restart since 0th bucket count went down.

Those cases seem rare in practice, and we've accepted those thus far as unsolvable. I'm OK with not properly handling those.

It sounds like using count to detect resets is probably our best bet.

IIRC, prom protobuf is going to begin include start timestamps soon, so hopefully these problems are obsoleted over time.

Understood, the one thing we have to make sure is that the exported/prometheusremotewrite MUST NOT set the the ResetHint to "NO", unless it's absolutely sure that there was no reset (as defined by Prometheus). Currently the hint is left to "UNKOWN" which is fine.

So basically it would be possible to set "NO" only with the create timestamp, since the current algorithm above is not compatible with the Prometheus definition of counter reset. cc @Aneurysm9 (as maintainer of the exporter).

krajorama avatar Nov 15 '23 08:11 krajorama

@dashpole this is mostly finished, I want to add a reference to a Prometheus issue we need to create and also test in real life with a couple of thousands of native histogram metrics.

krajorama avatar Nov 15 '23 13:11 krajorama

Understood, the one thing we have to make sure is that the exported/prometheusremotewrite MUST NOT set the the ResetHint to "NO", unless it's absolutely sure that there was no reset (as defined by Prometheus). Currently the hint is left to "UNKOWN" which is fine.

So basically it would be possible to set "NO" only with the create timestamp, since the current algorithm above is not compatible with the Prometheus definition of counter reset.

Would you mind adding a comment in the exporter(s), referencing prometheus' definition of reset to ensure that isn't changed? We could explicitly set it to unknown to make it clear that it is the correct value for us. It can be done separately from this PR.

dashpole avatar Nov 15 '23 14:11 dashpole

Would you mind adding a comment in the exporter(s), referencing prometheus' definition of reset to ensure that isn't changed? We could explicitly set it to unknown to make it clear that it is the correct value for us. It can be done separately from this PR.

Hi, created PR: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/29326

Also this PR is now ready for review.

krajorama avatar Nov 17 '23 13:11 krajorama

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Dec 06 '23 05:12 github-actions[bot]

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Dec 21 '23 05:12 github-actions[bot]

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Jan 05 '24 05:01 github-actions[bot]

I've talked to the potential tester of this PR and they said earliest February 2024. On the other hand the use case will be to scrape Grafana Mimir with it, so when I have time I can do the testing myself or someone from our team. Hope to get some movement on this soon.

krajorama avatar Jan 12 '24 07:01 krajorama

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Jan 27 '24 05:01 github-actions[bot]

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Feb 13 '24 05:02 github-actions[bot]

This is exciting! If this gets merged, would the prometheusremotewrite exporter be able to export native histograms, or would that require additional code changes?

jmcarp avatar Feb 14 '24 15:02 jmcarp

Actually, @krajorama what kind of testing are you looking for? I'm interested in using this feature to scrape native histograms from an application, then forward to mimir as native histograms. I might have time to test this out if it would be helpful.

jmcarp avatar Feb 15 '24 04:02 jmcarp

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Feb 29 '24 05:02 github-actions[bot]

Actually, @krajorama what kind of testing are you looking for? I'm interested in using this feature to scrape native histograms from an application, then forward to mimir as native histograms. I might have time to test this out if it would be helpful.

Hi @jmcarp, sorry about the long wait, was/is busy with other stuff. The use case is to scrape Mimir itself as it has a native histogram exposed cortex_request_duration_seconds (I'm working on adding more actually.) But any other application would be good as well that has native histograms, the point is to just go beyond unit tests and run it for some time and try restarts, realistic load - basically give it a spin.

The bad news is that it badly needs a rebase/merge from main now.

krajorama avatar Feb 29 '24 07:02 krajorama

This is exciting! If this gets merged, would the prometheusremotewrite exporter be able to export native histograms, or would that require additional code changes?

I have used the prometheusremotewrite exporter to write exponentil histograms to Prometheus/Mimir so that should indeed work.

krajorama avatar Feb 29 '24 07:02 krajorama

If testing will take more time, we can always merge it behind an alpha feature-gate to give time for testing. That way you don't need to keep rebasing it.

dashpole avatar Feb 29 '24 15:02 dashpole

we can always merge it behind an alpha feature-gate to give time for testing. That way you don't need to keep rebasing it.

Sounds great to me!

jmcarp avatar Feb 29 '24 19:02 jmcarp

To my surprise the rebase went smoothly. I'm going to give it a smoke test and try adding feature-gate

krajorama avatar Mar 01 '24 18:03 krajorama

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Mar 16 '24 05:03 github-actions[bot]

I've done some testing and found that I can now set ZeroThreshold in the otel model so that's now implemented. Also found a bug of not setting created timestamp correctly from _created metric.

Also ran into an issue with exemplars:

2024-03-16T20:03:44.340+0100	error	exporterhelper/queue_sender.go:97	Exporting failed.
Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite",
"error": "Permanent error: Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request;
err = %!w(<nil>): failed pushing to ingester 5444ebe325cd: user=anonymous: err: out of order exemplar.
timestamp=2024-03-16T19:02:51.834Z, series=cortex_request_duration_seconds{container=\"mimir-2\",
instance=\"mimir-2:8001\", job=\"mimir-2\", method=\"gRPC\", route=\"/cortex.Inges", "dropped_items": 883}

This is similar to https://github.com/prometheus/prometheus/pull/13021 where the fix was to ensure that exemplars have a timestamp (afaik otel already ensures this) and timestamps are in ascending order.

krajorama avatar Mar 16 '24 19:03 krajorama