tempo
tempo copied to clipboard
2x memory increase after upgrade
Describe the bug
After upgrading from tempo grafana/tempo:1e8583d9 to grafana/tempo:1.4.1 memory usage increased from 4Gib to 8Gib. CPU usage also increased.
To Reproduce Steps to reproduce the behavior: Helm chart version grafana/tempo 0.15.4 (Grafana Tempo Single Binary Mode)
Helm values
fullnameOverride: tempo
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "3100"
prometheus.io/path: "/metrics"
tempo:
retention: 48h
storage:
trace:
backend: local
resources:
requests:
cpu: "2"
memory: 8Gi
limits:
memory: 8Gi
persistence:
enabled: true
accessModes:
- ReadWriteOnce
size: 80Gi
Expected behavior No memory and cpu increase after upgrade.
Environment:
- Infrastructure: Kubernetes 1.22
- Deployment tool: helm
Additional Context Nothing specific, sometimes I see errors:
rpc error: code = FailedPrecondition desc = LIVE_TRACES_EXCEEDED: max live traces exceeded for tenant single-tenant: per-user traces limit (local: 10000 global: 0 actual local: 10000) exceeded
Will need to increase live traces limit, but I was hitting this limits with old version too.
The previous version of Tempo you were running is all the way back from December 2020: https://github.com/grafana/tempo/commit/1e8583d9a108496a35c235bb6a95ede860aff5b9. Since this version is from 1,5 years ago, it's hard to tell what changes increased memory and cpu usage. Did any of your config change?
I compared configs for old vs new helm chart. I'm using default values.
helm template test grafana/tempo --version=0.15.4
helm template test grafana/tempo --version=0.7.1
Seems on both versions tempo.yaml is almost the same. Probably some default configs changed on tempo itself or there is some regression.
I believe Tempo Search is enabled by default in the new version. Could you please check that -- Are you able to search for traces in the Grafana UI? The elevated memory usage might be related to that.
I think I found the issue here https://github.com/grafana/helm-charts/blob/main/charts/tempo/values.yaml#L106
After changed to block_retention
field it went back to normal.
Setting compacted_block_retention
is set as 1h by default on tempo.
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.