Add guidance against using metrics API in Web/Mobile client side instrumentation
What are you trying to achieve?
We want to add some language to the spec regarding the use of Metrics API in Client side instrumentations, specifically recommending not to use it it, for the reasons outlined below. If it's acceptable for this to be in the spec, we would like some guidance on where this could go in.
Update: Client-side instrumentations here refers to those that run in the context of apps of a single end user - for eg., web and mobile apps.
-
The client-side instrumentations typically run only in the context of a single user, so the measurements capture very limited data. The measurements will need to be collected and aggregated across multiple clients/users which can only be done on the server/receiver side. Because of that, we think metrics is a server side concern and not something to do on the client.
-
Data points are best collected via Spans and Events on the client side, and transformed into metrics on the server side. This will enable capturing additional details such as time of occurrence and other additional attributes, which is not possible with metrics.
-
The current Metrics API is too complex for most client-side usecases, where a much simpler API is often sufficient. Additionally, adding Metrics API and SDK to client-side agents increases the bundle size which is generally a concern for client environments.
This is something we have been following as a principle in client instrumentations already. However, this doesn't seem to be well known and came up recently in https://github.com/open-telemetry/opentelemetry-android/pull/1064 - this is the reason we want to document the guidance somewhere in the spec.
Tip: React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.
I agree with the sentiment -- it would be nice to have some agreed-upon spec language that dissuades users from attempting to use metrics on the client side, and to have a concise explanation about why it's not a great fit. A couple of notes on the above:
First, I think #1 above should reference high-cardinality. Client apps, by their very nature, run on lots of devices in lots of diverse environments, and the metric dimensions are almost always inherently high-cardinality, and for most timeseries databases this is a problem. If we make them low(er) cardinality (like by dropping dimensions), then metrics also lose some of their value. For example, if you drop an operating system version, then you can't compare performance (or whatever) between operating system versions, which means that a user only has access to a blended/aggregated signal.
I don't really agree with #3 -- that the metrics api is too complex. I do agree that on the client side, many simple things (like counts of things), are easily reported individually or just as an attribute on an event, which can later be aggregated server-side. Thanks @scheler !
This guidance seems very specific to Android. It does not seem applicable to clients of Golang applications. I question its applicability to the other OpenTelemetry languages other than mobile applications.
@MrAlias I updated the original message with this clarification -
Client-side instrumentations here refers to those that run in the context of apps of a single end user - for eg., web and mobile apps.
@open-telemetry/spec-sponsors can you take a look please and help to move this forward
OTel Metrics specification is not specifically written for server-side. It's quite general purpose. I don't see why the spec should have this warning at all.
https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/supplementary-guidelines.md can be a good place to call out any warnings .
I think having an explicit warning like this makes it clear that given the current state of the OTel Metrics spec and tooling, for user-facing client applications like mobile or web apps, it should not be used.
The issues of cardinality aside (which is a big issue, but I think most are aware of it), the reason this is useful is that even if you can cut the cardinality down, the type of server-side aggregation that erases the time dimension makes the collected telemetry not very useful most of the time for mobile and web apps.
That is, the total number of occurrences of a thing during a arbitrary time period (that would make sense for aggregation) doesn't tell us very much - knowing if a glut of X happened before or after another thing on a particular device would be. Basically, the aggregation window needs to local to the client so we can contextualize it with other things that are happening on the same device. It also needs to be small enough so that we can properly correlate with those other things. Knowing there are X number of frames dropped within 5 minutes isn't useful.
Nipping the sensible intension of modelling metrics-like telemetry as an OTel Metric at the bud by explicitly stating its inappropriateness for the mobile and web use case would be extremely useful.
@scheler thanks for putting this together. I would probably omit the 3rd point - I think the API is workable if it weren't for the cardinality and aggregation window issues. Anything that is elevated to a warning at the spec level should make the most obvious points why this is not workable, not just call out nice to haves like a simplified API.
I don't think the spec is the right place for guidance like that. The target audience of the spec is primarily SDK implementors, and secondarily perhaps some end users deeply interested in Otel. A more typical Otel user will likely interact with language SDK docs and Getting Started pages. I would advise having such guidance in somewhere there.
As for the content of the guidance I don't think a blanket advice against using metrics on client side is the best approach. It likely would be more useful if it is more nuanced and explains why (e.g. cardinality) so that the users can reason from the first principles and apply the guidance to their particular use case.
I don't think the spec is the right place for guidance like that. The target audience of the spec is primarily SDK implementors, and secondarily perhaps some end users deeply interested in Otel. A more typical Otel user will likely interact with language SDK docs and Getting Started pages. I would advise having such guidance in somewhere there.
As for the content of the guidance I don't think a blanket advice against using metrics on client side is the best approach. It likely would be more useful if it is more nuanced and explains why (e.g. cardinality) so that the users can reason from the first principles and apply the guidance to their particular use case.
I concur; This sort of documentation should be closer to implementers. Just because web/mobile clients don't currently implement otel metrics well doesn't mean we should warn people off in the spec.
The present challenge is that reviewers bounce into client semconv event PRs and see numeric attribute values and ask "Shouldn't this be a metric"?
Where is a better location than the spec then? Something that covers all of mobile and web would be great...but where? Just copy/paste across android/ios/web?
Just because web/mobile clients don't currently implement otel metrics well doesn't mean we should warn people off in the spec.
And we want them to implement it more, so we should serve them.
I am in favor of adding this guidance, it's a value add and as @cijothomas called out we have guidelines in the spec already.
As for the content of the guidance I don't think a blanket advice against using metrics on client side is the best approach.
Isn't that implied by a "guideline"? For me this reads more like in most of the cases (probably yours too!) don't use metrics on client side, here's a list of reasons, and here's also a list of exceptions, e.g. a long running client or a situation where there is only a limited and well defined list of clients, etc. (just making up some things)
I am in favor of adding this guidance, it's a value add and as @cijothomas called out we have guidelines in the spec already.
I don't think what this issue suggest is the same class of guidelines that we have in https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/supplementary-guidelines.md
Supplementary Guidelines contains generic advise, applicable to any domain where you want to use Otel metrics. If you can articulate what this issue is looking for in similarly generic terms then I don't mind having it in supplementary-guidelines.md. But if the text is going to be very specific about "client metrics with high cardinality" then I don't think it belongs to that document, it will be out of place.
adding this guidance in the semantic conventions repo is another option
@breedx-splk suggested we add a top level page somewhere in opentelemetry.io website where we can put guidelines for how client-side apps should and shouldn't use various aspects of OTel .
I'm going to work on that today so we can get at least something documented in public, and we an decide what and where to include any references to it in the spec repo.
I think our ultimate goal is the dissuade instrumentation meant for client-side apps to use OTel Metrics as a signal. If the spec isn't the place to put this (or at least expound on it in detail), then we can find some other places that would serve that purpose.
Supplementary Guidelines contains generic advise, applicable to any domain where you want to use Otel metrics. If you can articulate what this issue is looking for in similarly generic terms then I don't mind having it in supplementary-guidelines.md.
I think that this can be generalized, since there are non-client side situations where the same arguments apply, i.e. short-running server-side applications like functions as a service or other short-lived or single-user situations.
There are a few scenarios where client-side metrics could provide meaningful operational value: • Point-of-sale systems: A retail environment may have only a handful of cash registers, each running both client and server components on shared hardware. If one begins exhibiting degraded performance, metrics could help isolate resource contention or configuration issues quickly. • Limited fleets of low-end devices: I recently assisted an Afghani entrepreneur deploying a Flutter app that empowers women (especially) through mobile top-up sales. In deployments like this—common in developing regions—a small number of inexpensive Android devices are used intensively under constrained network and hardware conditions. Basic metrics on CPU, memory, and latency can be critical for determining which devices can handle the workload reliably.
In cases like these, would client-side metrics not be the most appropriate mechanism? Are there preferred alternatives for capturing this kind of localized, small-scale telemetry?
Since implementers can control what data is collected and transmitted, cardinality can remain intentionally low—especially when the device set is fixed and limited.
I’d be happy to explore this further. I’ve been focused recently on the Dart OTel SDK implementation but am trying to get back into contributing to the specification discussions.