tempo
tempo copied to clipboard
Workload Identity Federation only working with uniform bucket level access?
We are running Tempo on GKE and using a GCS bucket with uniform bucket level access set to 'false' as a storage backend. We have had this setup for quite some time and it's been running without problems. Recently we switched from using the "old" Workload Identity Federation (creating both a Kubernetes and a IAM Service account, linking them using an annotation on the Kubernetes account, and granting the necessary roles to the IAM account) to using the updated form of Workload Identity Federation (see here) - Basically, giving Kubernetes Principals permissions on GCP resources directly without the need of also linking to IAM service accounts.
To allow Tempo to access the bucket we have given the following principal the storage.objectAdmin role on the bucket and also permissions to list all buckets in the project:
principal://iam.googleapis.com/projects/XXX/locations/global/workloadIdentityPools/XXX.svc.id.goog/subject/ns/tracing/sa/tempo
We are now seeing errors such as the following every few minutes in the logs:
level=error ts=2024-04-10T09:27:41.032325318Z caller=tempodb.go:462 msg="failed to poll blocklist. using previously polled lists" err="googleapi: got HTTP response code 412 with body: <?xml version='1.0' encoding='UTF-8'?><Error><Code>PreconditionFailed</Code><Message>The operation requires that Uniform Bucket Level Access be enabled.</Message><Details>The type of authentication token used for this request requires that Uniform Bucket Level Access be enabled.</Details></Error>"
And also:
level=error ts=2024-04-09T16:01:09.385524514Z caller=poller.go:195 msg="failed to write tenant index" tenant=single-tenant err="googleapi: Error 412: The type of authentication token used for this request requires that Uniform Bucket Level Access be enabled., conditionNotMet"
We had no issues whatsoever when using the previous form of Workload Identity and it's not documented anywhere that in order to use this type of authorization you need to enable uniform bucket level access. Is this an issue with Tempo? Or a general issue with how the updated Workload Identity Federation works?
Steps to reproduce the behavior:
- Run Tempo (2.0.1, but I tried also on 2.4.1 and still get the same problem) on Kubernetes with a GCS bucket storage backend and Workload Identity enabled on the cluster.
- Give the tempo service account principal permissions on the GCS bucket
- Wait for Tempo to start up and view the log
Expected Behaviour: Tempo reports no errors regarding the GCS bucket.
Environment:
- Infrastructure: GKE
- Deployment tool: Custom Helm Chart
Additional Context:
tempo.yaml:
target: scalable-single-binary
multitenancy_enabled: false
server:
http_listen_port: 3100
log_level: info
distributor:
log_received_spans:
enabled: false # for debugging only, should be set to false on production
receivers:
jaeger:
protocols:
thrift_compact:
endpoint: 0.0.0.0:6831
thrift_binary:
endpoint: 0.0.0.0:6832
grpc:
endpoint: 0.0.0.0:14250
thrift_http:
endpoint: 0.0.0.0:14268
ingester:
lifecycler:
heartbeat_period: 100ms
ring:
kvstore:
store: memberlist
compactor:
ring:
kvstore:
store: memberlist
compaction:
compacted_block_retention: 24h
memberlist:
abort_if_cluster_join_fails: false
bind_port: 7946
join_members:
- tempo-headless.tracing.svc.cluster.local:7946
storage:
trace:
backend: gcs
gcs:
bucket_name: XXX-tempo
local:
path: /var/tempo/traces
wal:
path: /var/tempo/wal
query_frontend:
search:
max_duration: 720h1m0s
overrides:
max_search_bytes_per_trace: 100000
querier:
frontend_worker:
frontend_address: tempo-headless.tracing.svc.cluster.local:9095
Thanks for the report. I honestly don't know the details of Uniform Bucket Access to comment directly, but let's dig in a bit.
Let's focus on the second of the two errors. It's logged here:
https://github.com/grafana/tempo/blob/0dd31db8137094ad692e75ba15365e4e0039ae47/tempodb/blocklist/poller.go#L242
Which is passed through here:
https://github.com/grafana/tempo/blob/0dd31db8137094ad692e75ba15365e4e0039ae47/tempodb/backend/raw.go#L108
And ultimately lands on this call:
https://github.com/grafana/tempo/blob/0dd31db8137094ad692e75ba15365e4e0039ae47/tempodb/backend/gcs/gcs.go#L107
I'm not sure what underlying GCS call that maps to, but I believe we're using the standard SDK in an appropriate manner. What's interesting is Tempo writes data all the time using this call for the blocks. Are you seeing any issues with your ingesters or compactors flushing/creating blocks?
The main difference I can think of is where in the object hierarchy the objects are written:
is broken:
gs://<bucket>/<tenant>/index.json.gz
seemingly works:
gs://<bucket>/<tenant>/<block guid>/<various block files>
Google have updated their docs and now explain why this is happening. Apparently setting 'Uniform Bucket Level Access' to true is a requirement when using IAM principals for Workload Identity Federation. The solution they suggest is as I described in the issue - using the 'old' method of linking a k8s account to an IAM account. See here: https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#kubernetes-sa-to-iam
@joe-elliott Thanks for your assistance and time!
(Kind of off-topic, but this was the most relevant issue I could find)
@joe-elliott Is it possible to use Workload Identity Federation together with the tempo-operator? The docs seem to indicate that the necessary secret must contain the bucket name as well as the service account key.json, which in case of a WIF is not needed. So can I leave the field empty or how should one approach this?
No idea. I would file an issue or start a discussion here: https://github.com/grafana/tempo-operator