[Bug] sdk metrics do not include tags added with "telemetry.global_tags"
What are you really trying to do?
Add custom tags to SDK metrics
Describe the bug
sdk metrics no longer include tags added with telemetry.global_tags , which appear to be the expected behavior https://python.temporal.io/temporalio.runtime.TelemetryConfig.html#global_tags
running the same with version 1.3 works as expected
Minimal Reproduction
Environment/Versions
- OS and processor: [e.g. M1 Mac, x86 Windows, Linux]
- Temporal Version: [e.g. 1.14.0?] and/or SDK version
- Are you using Docker or Kubernetes or building Temporal from source?
Additional context
To clarify, issue #380 inadvertently didn't set global_tags on Rust side. We need to make sure it is set and add a test or alter test_different_runtimes to confirm this setting (and maybe any others we want coverage for).
The issue here is that we use OTel's Prometheus exporting which takes the approach at https://prometheus.io/docs/guides/opentelemetry/#including-resource-attributes-at-query-time. Namely the global_tags are on target_info which you can combine with the rest at query time. I have opened https://github.com/temporalio/sdk-core/issues/882 to have us put the labels on every metric by default (with an option to restore to today's behavior).
Note, just for linking, this is caused by an upstream OTel Rust issue (our Rust core relies on this): https://github.com/open-telemetry/opentelemetry-rust/issues/2858
I have opened https://github.com/temporalio/sdk-core/issues/882 to have us put the labels on every metric by default (with an option to restore to today's behaviour).
@cretz, what was the original intended behaviour of global_tags supposed to be? It could be a bit dangerous to change the default behaviour to updating metric labels now, as it could catch people off guard if they now rely on the current behaviour.
For a long time on my project, I've been using global_tags to set labels like service_name, service_namespace, etc. But I had a downstream Alloy collector that added these labels to all metrics (not just Temporal SDKs) from the OTEL resource by default, if they weren't already set. So I never noticed the regression in behaviour, since global_tags changed to update the OTEL resource / Prometheus target_info in the SDK, they were set as metric labels by the collector instead...
Since then, I've added an ECS resource detector to the app, so I can set OTEL resource attributes like aws.ecs.task.id, aws.ecs.cluster.arn, etc. to all telemetry, and I've been using global_tags to do this for the SDK metrics, assuming this was always the way it was intended to work (since there isn't any other option to set an OTEL resource).
So I'd suggest, if possible, rather than have an option that toggles the behaviour of global_tags. It would be better to have two options, one to set resource attributes (maybe directly in OpenTelemetryConfig) and one to set global metric labels. There may be cases where people want to have both options simultaneously.
Cheers 🙏🏻
The original intended behavior is that the tags would apply to all emitted metrics. I agree with you that the two separate options for OTel makes good sense, we can add an item to track adding those options.
@gregbrowndev - with global_tags now applying to all metrics (something we didn't intend to undo), is that acceptable? Or are you asking for a new setting for, say, target_info_tags?
@cretz sorry if I was unclear. For me, if we only had one option, it would make more sense to be able to configure the OTEL resource attributes / Prometheus target_info via the TelemetryConfig, rather than adding common attributes to every metric. If we had both options, then that would be better ofc.
(This is probably quite bias from having an OTEL/Alloy collector set up, but...)
The latter is a convenience. Its quite easy to add metric labels (datapoint attributes in OTEL) anywhere outside of the application by reading them from the resource attributes (example). However, it'd be pretty horrible to configure the right Resource attributes if there wasn't a direct way to do it in the SDK. You'd basically have to intercept the metrics after they've been emitted by the Rust SDK with another collector sidecar, so you could use a resource detector to fix them before sending them along.
However, maybe the opposite is true for Prometheus scraping job users. The difference is OTEL is primarily push-based, so its better to set the resource attributes at the source (I'm using the same Resource object from an ECS detector library within the Python app to configure metrics, logs, traces, and profiles, for example. The metrics are sent to Prometheus as well...). But Prometheus scraping is pull-based, and you'd typically need service discovery anyway to do it at scale. In k8s, you get the resource metadata built-in with that (I think).
I might be wrong, but I think you can relabel metrics in Prometheus scrape config using target_info in the same way that Alloy example does it. And for me it would be nicer to have a single Resource object in Python that can configure all the telemetry the same way, not just Prometheus metrics.
Hope that helps
For me, if we only had one option, it would make more sense to be able to configure the OTEL resource attributes / Prometheus target_info via the TelemetryConfig, rather than adding common attributes to every metric.
Unfortunately many others saw the one option, global_tags, as meaning global and applying on every metric, and that is how we had it at one point and regressed, but we fixed the regression.
Can you be a bit more concrete on the specific ask here? Knowing that we cannot alter the global_tags behavior that people currently rely on, what other option might you need and what should its behavior be?