opentelemetry-specification
opentelemetry-specification copied to clipboard
Reconsider modifying metric name with OTLP unit
The prometheus conversion currently requires metric names to be modified with the OTLP unit
Additionally, the unit MUST be added as a suffix to the metric name, and SHOULD be converted to base units recommended by OpenMetrics when possible
I think this should be reconsidered, I suspect users don't expect their metric names to not actually be the names, and e.g. when configuring grafana http_server_duration seems expected vs http_server_duration_ms. Adding UNIT to the help text makes sense though.
@open-telemetry/wg-prometheus
The prometheus metric would ideally be named http_server_duration_seconds, which is close to the expected naming for that metric in the prometheus ecosystem (http_request_duration_seconds, which is used in the prometheus metric naming examples).
We could consider making it a "MUST, by default", rather than just a "MUST", to allow users the option of excluding the unit. According to OpenMetrics, if we add a unit comment (which we really should IMO), we MUST add the unit as the metric suffix as well.
Thanks I was looking for the text in OpenMetrics but couldn't find it. It looks like it should be added then, wonder if OTel's default of ms vs seconds will cause issues for users though
Yeah, a metric ending in _ms would be strange to prometheus users. Even spelling out _milliseconds would be better than _ms. Ideally, exporters would convert metrics in ms to seconds in their Prometheus exporter. That is what I intended by SHOULD be converted to base units recommended by OpenMetrics in the spec. Do you think that is feasible?
Do you think that is feasible?
We'd have to maintain a table of origin units, target units, and conversion functions. Could probably have a smallish library of common situations that would cover the majority of use cases. Ideally this would be spec'd.
Histograms are a bit tricky. Suppose you have an instrument aggregated as explicit bucket with the default bucket boundaries 0, 5, 10, 25, 50, 75, 100, 250, 500, 1000. Data was originally recorded in milliseconds and we're trying to convert it to seconds. One problem is that the bucket boundaries were chosen for measurements in milliseconds. Do you divide all the bucket boundaries by 1000? I.e. 0, .005, .01, .025, .05, .075, .1 .25, .5, 1? The alternative is to keep the boundaries the same but shuffle around the bucket counts. However, this isn't possible without access to the original measurements. E.g. values > 1000 could be mapped into any of the lower buckets, or even stay in the > 1000 bucket.
Scaling all sum + bucket boundaries seems like the best in the histogram case. The only case where I don't think it works correctly is for exponential histograms. IIUC, bucket boundaries are powers of 2, so we can change the scale by 1024 (2^10), but not exactly 1000.
It's not clear to me how exponential histograms would be converted to prometheus in general. Are exponential histograms allowed to be exported via prometheus? If yes, would the exponential bucket boundaries just be computed and converted explicit buckets? If yes, can we just scale the boundaries ignoring the fact that they were originally an exponential scale?
One of the prometheus maintainers is working on support for exponential histograms in prometheus, but i'm not sure what the plan is or how far along they are. For now, the Prometheus <-> OTLP spec says to drop exponential histogram points, so you are correct that it isn't an issue right now.
Hi, we expect the metrics to come with an unit: http_server_duration_seconds for example. This is what someone using Prometheus is accustomed to today and we even have a linter to make sure: curl --silent [http://example.com:9090/metrics/] | promtool check metrics
Its a bit unfortunate that OTel is using milliseconds and Prometheus is using seconds, and one of the main reasons we decided to mandate base units is to avoid conversions when doing math.
With these two cases above, I think it makes more sense to include the unit as it will become very obvious what it is when writing the query and obvious if metrics with two different units are being used together.
Regarding exponential histograms, I think we'll have to just live with the inconsistency between milliseconds and seconds.
I've written a proposal related to this, which proposes not adding unit to the metric name when translating from OTel to prometheus: https://docs.google.com/document/d/16Wo-QHZLcKO0uFx97HPUJ-ZMFXSPviQ7Oc4-bShgcto/edit?usp=sharing
Feedback is welcome. It would be especially helpful if you have any feedback from users one way or another.
I'm hoping to discuss it at the next Prometheus dev summit on 2/22.