Backend response duration is too high
Describe the bug Hi! We use Grafana Tempo in our team. We faced the issue recently that simple query takes so much time. How can we tune Tempo performance to see traces immediately?
Query:
{resource.service.name="${service_name}" && resource.env="${env}" && 400 <= .http.status_code && .http.status_code != 404 && .http.status_code < 600}
Response time:
To Reproduce Steps to reproduce the behavior:
- Start chart tempo-operational v1.7.1
- Perform Operations (Read).
- Wait too long
Expected behavior See traces immediately
Environment:
- Kubernetes v1.24
- helm tempo-distributed chart
Additional Context We use tempo-operational dashboard to monitor our Tempo instance. And there are what we see on screenshots below:
We think that there is the problem in Querier component directly. So we gave him a lot of resources, but it still works slowly.
querier:
replicas: 2
resources:
requests:
cpu: 2
memory: 2Gi
limits:
cpu: 8
memory: 10Gi
How can we boost perfomance?
There's lot of ways to improve the perf of TraceQL! Listed in the order I think you should consider them:
- An instant big win would be to add scopes to all of your attributes:
{resource.service.name="${service_name}" && resource.env="${env}" && 400 <= span.http.status_code && span.http.status_code != 404 && span.http.status_code < 600}
-
Set up GRPC Streaming https://grafana.com/docs/tempo/latest/api_docs/#tempo-grpc-api This also (currently) requires setting a Grafana feature flag
-
Use multiple caching layers which are added in 2.4: https://grafana.com/docs/tempo/next/configuration/#cache
-
Search perf configurables This advice is a bit out of date, and only applies once you start scaling Tempo quite large. I would ignore the serverless parts (we have had issues getting good perf), but the major tunables discussion is still correct. If you are running 50+ queriers I would start to care about this. https://grafana.com/docs/tempo/latest/operations/backend_search/
@joe-elliott thank you for your reply! I'll try it on next week and give feedback about what really helped us!
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.
I apologize for such a long duration, so there is our feedback. We've used your 1, 3, and 4 advices and I can certainly say that their order by value in terms of performance is absolutely right for us. We've noticed good changes immediately by optimizing our TraceQL queries. The second one has helped a lot as well. We drop the most heavy and useless attributes in our collector and there are some results: our storage has been filling up more slower and tempo search has been working more faster since these changes have been made. And the fourth one has made the most minor improvements, anyway it's better than nothing :) Thank you again, @joe-elliott!