tempo icon indicating copy to clipboard operation
tempo copied to clipboard

Workload Identity Federation only working with uniform bucket level access?

Open Dadavan opened this issue 1 year ago • 1 comments

We are running Tempo on GKE and using a GCS bucket with uniform bucket level access set to 'false' as a storage backend. We have had this setup for quite some time and it's been running without problems. Recently we switched from using the "old" Workload Identity Federation (creating both a Kubernetes and a IAM Service account, linking them using an annotation on the Kubernetes account, and granting the necessary roles to the IAM account) to using the updated form of Workload Identity Federation (see here) - Basically, giving Kubernetes Principals permissions on GCP resources directly without the need of also linking to IAM service accounts.

To allow Tempo to access the bucket we have given the following principal the storage.objectAdmin role on the bucket and also permissions to list all buckets in the project:

principal://iam.googleapis.com/projects/XXX/locations/global/workloadIdentityPools/XXX.svc.id.goog/subject/ns/tracing/sa/tempo

We are now seeing errors such as the following every few minutes in the logs:

level=error ts=2024-04-10T09:27:41.032325318Z caller=tempodb.go:462 msg="failed to poll blocklist. using previously polled lists" err="googleapi: got HTTP response code 412 with body: <?xml version='1.0' encoding='UTF-8'?><Error><Code>PreconditionFailed</Code><Message>The operation requires that Uniform Bucket Level Access be enabled.</Message><Details>The type of authentication token used for this request requires that Uniform Bucket Level Access be enabled.</Details></Error>"

And also:

level=error ts=2024-04-09T16:01:09.385524514Z caller=poller.go:195 msg="failed to write tenant index" tenant=single-tenant err="googleapi: Error 412: The type of authentication token used for this request requires that Uniform Bucket Level Access be enabled., conditionNotMet"

We had no issues whatsoever when using the previous form of Workload Identity and it's not documented anywhere that in order to use this type of authorization you need to enable uniform bucket level access. Is this an issue with Tempo? Or a general issue with how the updated Workload Identity Federation works?

Steps to reproduce the behavior:

  1. Run Tempo (2.0.1, but I tried also on 2.4.1 and still get the same problem) on Kubernetes with a GCS bucket storage backend and Workload Identity enabled on the cluster.
  2. Give the tempo service account principal permissions on the GCS bucket
  3. Wait for Tempo to start up and view the log

Expected Behaviour: Tempo reports no errors regarding the GCS bucket.

Environment:

  • Infrastructure: GKE
  • Deployment tool: Custom Helm Chart

Additional Context:

tempo.yaml:

target: scalable-single-binary

multitenancy_enabled: false

server:
  http_listen_port: 3100
  log_level: info

distributor:
  log_received_spans:
    enabled: false # for debugging only, should be set to false on production

  receivers:
    jaeger:
      protocols:
        thrift_compact:
          endpoint: 0.0.0.0:6831
        thrift_binary:
          endpoint: 0.0.0.0:6832
        grpc:
          endpoint: 0.0.0.0:14250
        thrift_http:
          endpoint: 0.0.0.0:14268

ingester:
  lifecycler:
    heartbeat_period: 100ms
    ring:
      kvstore:
        store: memberlist

compactor:
  ring:
    kvstore:
      store: memberlist
  compaction:
    compacted_block_retention: 24h

memberlist:
  abort_if_cluster_join_fails: false
  bind_port: 7946
  join_members:
    - tempo-headless.tracing.svc.cluster.local:7946

storage:
  trace:
    backend: gcs
    gcs:
      bucket_name: XXX-tempo
    local:
      path: /var/tempo/traces
    wal:
      path: /var/tempo/wal

query_frontend:
  search:
    max_duration: 720h1m0s

overrides:
  max_search_bytes_per_trace: 100000

querier:
  frontend_worker:
    frontend_address: tempo-headless.tracing.svc.cluster.local:9095

Dadavan avatar Apr 10 '24 09:04 Dadavan

Thanks for the report. I honestly don't know the details of Uniform Bucket Access to comment directly, but let's dig in a bit.

Let's focus on the second of the two errors. It's logged here:

https://github.com/grafana/tempo/blob/0dd31db8137094ad692e75ba15365e4e0039ae47/tempodb/blocklist/poller.go#L242

Which is passed through here:

https://github.com/grafana/tempo/blob/0dd31db8137094ad692e75ba15365e4e0039ae47/tempodb/backend/raw.go#L108

And ultimately lands on this call:

https://github.com/grafana/tempo/blob/0dd31db8137094ad692e75ba15365e4e0039ae47/tempodb/backend/gcs/gcs.go#L107

I'm not sure what underlying GCS call that maps to, but I believe we're using the standard SDK in an appropriate manner. What's interesting is Tempo writes data all the time using this call for the blocks. Are you seeing any issues with your ingesters or compactors flushing/creating blocks?

The main difference I can think of is where in the object hierarchy the objects are written:

is broken:

gs://<bucket>/<tenant>/index.json.gz

seemingly works:

gs://<bucket>/<tenant>/<block guid>/<various block files>

joe-elliott avatar Apr 15 '24 19:04 joe-elliott

Google have updated their docs and now explain why this is happening. Apparently setting 'Uniform Bucket Level Access' to true is a requirement when using IAM principals for Workload Identity Federation. The solution they suggest is as I described in the issue - using the 'old' method of linking a k8s account to an IAM account. See here: https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#kubernetes-sa-to-iam

@joe-elliott Thanks for your assistance and time!

Dadavan avatar May 15 '24 08:05 Dadavan

(Kind of off-topic, but this was the most relevant issue I could find)

@joe-elliott Is it possible to use Workload Identity Federation together with the tempo-operator? The docs seem to indicate that the necessary secret must contain the bucket name as well as the service account key.json, which in case of a WIF is not needed. So can I leave the field empty or how should one approach this?

markustoivonen avatar Jun 19 '24 10:06 markustoivonen

No idea. I would file an issue or start a discussion here: https://github.com/grafana/tempo-operator

joe-elliott avatar Jun 20 '24 12:06 joe-elliott