tempo icon indicating copy to clipboard operation
tempo copied to clipboard

Tempo error failed to get trace with id *** Status: 404 Not Found Body: trace not found

Open Amikh opened this issue 1 year ago • 17 comments

Error: Failed to retrieve trace with ID ****. Status: 404 Not Found. Body: Trace not found. Using Loki, Tempo, OpenTelemetry; all applications are running in Kubernetes (1.27). For deployment, use Helm chart and ArgoCD. For the backend, utilize Azure Storage. Grafana View: image Log from POD image and block in blob created image

How to fix it ?

Amikh avatar Nov 29 '23 16:11 Amikh

Please review this troubleshooting doc: https://grafana.com/docs/tempo/latest/troubleshooting/unable-to-see-trace/

joe-elliott avatar Nov 29 '23 16:11 joe-elliott

Hi joe, not found any error not in metrics and not in pod, image Log from pod: image

Can you help with this?

Amikh avatar Dec 03 '23 08:12 Amikh

Can you please follow all the steps in the troubleshooting guide?

joe-elliott avatar Dec 04 '23 15:12 joe-elliott

check the flowing metrics in attached file before http://:3100/metrics tempo_ingester_traces_created_total{tenant="single-tenant"} 147 and tempo_distributor_spans_received_total{tenant="single-tenant"} 294
now check in my azure storage index.json.gz and see created new blockID think its not case 1 and not case 2 and in log query-frontend not have error 500 now Loki is working have id trace number in log and see it inside POD in file but not found in tempo search level=info ts=2023-12-04T15:40:18.144120877Z caller=tempodb.go:343 org_id=single-tenant msg="searching for trace in block" findTraceID=28624e65e9781d2744f6f743e3 block=fb740a87-49f2d-376baa2377c2 found=false

How can I check if a trace ID has been written to a block, and what additional steps should I take to ensure thorough verification?

Amikh avatar Dec 04 '23 18:12 Amikh

You can enable log received spans and see if Tempo ingests any trace ids.

You can also run the tempo cli query blocks command to dig into the blocks directly.

joe-elliott avatar Dec 04 '23 18:12 joe-elliott

@Amikh Hello, did you find solution?

nurmukhamed1112 avatar Jan 18 '24 11:01 nurmukhamed1112

I have a similar issue, I can search traces fine as long as they're less than a couple of days old. I'm in GKE though and Cloud Storage, also K8s 1.27, but using Splunk instead of Loki, that's a different story. Often developers are confronted with a 404 error when searching for trace id's. I just ran through your troubleshooting guide https://grafana.com/docs/tempo/latest/troubleshooting/unable-to-see-trace/ and all the traces are being consumed I have some nice pretty graphs- they're just not being returned after a period of time. I've also checked the logs for the errors outlined in the guide and everything shows successful. Our retention on the bucket is 90 days and developers can't search 2 days back for a trace that they viewed 2 days ago, so they know it existed. The query returns a 404 immediately so don't believe it to be a timeout. I'm using the tempo-distributed helm chart to deploy, minimally overridden for resources and replicas and an endpoint for the metrics-generator.

image image

jlcrow avatar Jan 20 '24 01:01 jlcrow

I have a similar issue, I can search traces fine as long as they're less than a couple of days old. I'm in GKE though and Cloud Storage, also K8s 1.27, but using Splunk instead of Loki, that's a different story. Often developers are confronted with a 404 error when searching for trace id's. I just ran through your troubleshooting guide https://grafana.com/docs/tempo/latest/troubleshooting/unable-to-see-trace/ and all the traces are being consumed I have some nice pretty graphs- they're just not being returned after a period of time. I've also checked the logs for the errors outlined in the guide and everything shows successful. Our retention on the bucket is 90 days and developers can't search 2 days back for a trace that they viewed 2 days ago, so they know it existed. The query returns a 404 immediately so don't believe it to be a timeout. I'm using the tempo-distributed helm chart to deploy, minimally overridden for resources and replicas and an endpoint for the metrics-generator.

image image

I just pulled the tempo config from the cluster and I see the compactor is defaulting to a 48h block retention as the default value in your helm chart when all your documentation reads the default is 14 days. image

jlcrow avatar Jan 20 '24 01:01 jlcrow

@nurmukhamed1112. No not found and change block retention not works for me getting same error. Let me know if you find another way how to fix it.

Amikh avatar Jan 21 '24 20:01 Amikh

I more or less have the same issue - but what I encounter is the following:

  1. My tempo runs in a cluster containing 2 nodes.
  2. Node 1 receives a Trace from the Grafana-Agent.
  3. Via Grafana UI I want to display that trace but Grafana is asking Node 2. And since the Trace is still on Node 1 inmemory Node 2 was not able to find that trace on the S3.

I have this assumption since I only see this problem on very new traces (less than 30min) old. Every older trace is no problem for me the display.

(My knowlege on how Tempo itself works and handels data within a memberlist is very low)

JustNZ avatar Jan 31 '24 18:01 JustNZ

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

github-actions[bot] avatar Apr 01 '24 00:04 github-actions[bot]

We have the same issue. We use distributed version with GCS as a storage. The Explore page returns 404, but GCS storage keeps the block.

mglotov avatar Apr 01 '24 07:04 mglotov

@joe-elliott I have the same issue, in my case it seems related to the querier.

For example, I tried to look for traceId 00819cd8844bfd02e96630db1d5336e0 earlier today (at 16:09 UTC). As you can see from logs, it searched on several blocks, until it found in block 02c5d1c6-6c01-461e-8a70-31c4fb2af8c6.

2024-05-06T16:09:23.348781001Z level=info ts=2024-05-06T16:09:23.348678215Z caller=poller.go:131 msg="blocklist poll complete" seconds=0.095456498
2024-05-06T16:09:23.615993195Z level=info ts=2024-05-06T16:09:23.615854427Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=912f6da7-7e06-405d-b953-dc868678b01d found=false
2024-05-06T16:09:23.617909082Z level=info ts=2024-05-06T16:09:23.617814124Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=7cc9add7-95e2-4b0b-828b-69468e13c9cf found=false
2024-05-06T16:09:23.619635599Z level=info ts=2024-05-06T16:09:23.619547753Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=1df50097-4419-414e-9bcd-7c1c26c0293b found=false
2024-05-06T16:09:23.620036741Z level=info ts=2024-05-06T16:09:23.619974307Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=f6855ad8-d01d-4c2f-996e-6a7f31cefeac found=false
2024-05-06T16:09:23.620709883Z level=info ts=2024-05-06T16:09:23.620640416Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=35867aae-8157-440f-877b-173f55240ff9 found=false
2024-05-06T16:09:23.621599900Z level=info ts=2024-05-06T16:09:23.621415883Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=55196cac-b6ae-45d4-a076-29f04d80ae3f found=false
2024-05-06T16:09:23.622123052Z level=info ts=2024-05-06T16:09:23.622053024Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=748974c5-ee99-46da-8ff8-7d59df62a66e found=false
2024-05-06T16:09:23.622682830Z level=info ts=2024-05-06T16:09:23.622606823Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=8bc70bc9-059c-4e29-a782-ee7ef1274b56 found=false
2024-05-06T16:09:23.623373815Z level=info ts=2024-05-06T16:09:23.62329715Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=b574fdd4-4f91-401b-b5af-b2d2d73d21b6 found=false
2024-05-06T16:09:23.625273373Z level=info ts=2024-05-06T16:09:23.625190876Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=9c7564fa-9553-4b1f-b8b6-1af5573fed16 found=false
2024-05-06T16:09:23.628648535Z level=info ts=2024-05-06T16:09:23.628569828Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=bb52d0f0-6c0d-4a3c-b186-ac519c729625 found=false
2024-05-06T16:09:23.629079336Z level=info ts=2024-05-06T16:09:23.628676106Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=babea682-b9e8-4d46-92c0-9d9853c188bd found=false
2024-05-06T16:09:23.629207866Z level=info ts=2024-05-06T16:09:23.629142749Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=da12b09a-02cd-422c-997f-ad316c037447 found=false
2024-05-06T16:09:23.629750401Z level=info ts=2024-05-06T16:09:23.629692108Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=b641c420-57d2-420b-b01f-f04c991c34f8 found=false
2024-05-06T16:09:23.630105911Z level=info ts=2024-05-06T16:09:23.629972355Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=d07fd6a3-c776-4509-888d-e4cea13644ca found=false
2024-05-06T16:09:24.236458056Z level=info ts=2024-05-06T16:09:24.236240973Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=02c5d1c6-6c01-461e-8a70-31c4fb2af8c6 found=true

Then 7 minutes later (at 16:16 UTC), without leaving Grafana Explorer UI, I tried searching the exact same traceId, and it couldn't find:

2024-05-06T16:16:13.135455503Z level=info ts=2024-05-06T16:16:13.13505275Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=a5c40b0b-0908-4cb7-8562-1545d45525a1 found=false
2024-05-06T16:16:13.137428067Z level=info ts=2024-05-06T16:16:13.137144208Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=0a3b9c07-9bb0-48ba-a536-5085560e17b5 found=false
2024-05-06T16:16:13.138458087Z level=info ts=2024-05-06T16:16:13.138196051Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=e3a1b1c1-fde6-4a00-9e05-2e48635cece0 found=false
2024-05-06T16:16:13.139099170Z level=info ts=2024-05-06T16:16:13.138822337Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=be9c063a-a54d-410b-821a-31f47edc5d24 found=false
2024-05-06T16:16:13.142028669Z level=info ts=2024-05-06T16:16:13.141894338Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=6143fde7-8495-4e2a-bac5-c16bbbabea9f found=false
2024-05-06T16:16:13.142393633Z level=info ts=2024-05-06T16:16:13.142297115Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=aa9bd845-725a-475a-95ee-dfe39a667355 found=false
2024-05-06T16:16:13.142501179Z level=info ts=2024-05-06T16:16:13.142429661Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=6355d4e4-cf75-4fe2-9b6f-312ca60fd32e found=false
2024-05-06T16:16:13.143082695Z level=info ts=2024-05-06T16:16:13.142935509Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=8391a4dd-f3d9-45bc-bdb2-b2afa332a478 found=false
2024-05-06T16:16:13.143313458Z level=info ts=2024-05-06T16:16:13.139413378Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=4c3138fa-8888-43ec-a654-19bdb4c6f2b9 found=false
2024-05-06T16:16:13.143662965Z level=info ts=2024-05-06T16:16:13.139813672Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=2bcc8c8c-9931-4ce7-b3e4-bc59124ddb5c found=false
2024-05-06T16:16:13.143945348Z level=info ts=2024-05-06T16:16:13.140249615Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=e13a0317-7daa-42b0-93d6-6fef4ea7b0d5 found=false
2024-05-06T16:16:13.144273873Z level=info ts=2024-05-06T16:16:13.140544814Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=c9995c70-5ac8-488d-bf75-839134f86f54 found=false
2024-05-06T16:16:13.144531871Z level=info ts=2024-05-06T16:16:13.140911649Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=fee97ff1-e8ba-41a0-b558-9f5fe5d625b0 found=false
2024-05-06T16:16:13.144731770Z level=info ts=2024-05-06T16:16:13.143958815Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=5dbbeee0-5372-4e63-af99-a27f4ffef13c found=false
2024-05-06T16:16:13.144939505Z level=info ts=2024-05-06T16:16:13.14471401Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=6693ab40-d1f0-4845-9e54-2100812b8f7b found=false
2024-05-06T16:16:13.145502617Z level=info ts=2024-05-06T16:16:13.145423265Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00819cd8844bfd02e96630db1d5336e0 block=08c58d2c-fed8-4c5b-9de7-c45b267ae5cf found=false
2024-05-06T16:16:15.617527780Z level=info ts=2024-05-06T16:16:15.615539994Z caller=tempodb.go:330 org_id=single-tenant msg="searching for trace in block" findTraceID=00e24ea8015aaaa8d77ed3ad934eb28a block=aa9bd845-725a-475a-95ee-dfe39a667355 found=false

In this second attempt, for some reason it didn't try searching the block 02c5d1c6-6c01-461e-8a70-31c4fb2af8c6 (where it previously found).

Any clues what could be wrong?

AlissonRS avatar May 06 '24 16:05 AlissonRS

As you can see from logs, it searched on several blocks, until it found in block 02c5d1c6-6c01-461e-8a70-31c4fb2af8c6

I would check your compactor logs for that block guid. We can use the logs to determine if the compactor dropped the block b/c it thought it was out of retention or if the block was compacted away.

joe-elliott avatar May 08 '24 20:05 joe-elliott

@joe-elliott I think my issue is caused by a misconfiguration on my Grafana instance.

I searched for traces within a specific time range: 2024-05-07 07:40:00 to 2024-05-07 07:45:59 (as you can see in the recent queries on screenshot).

It listed several traces. When I clicked on one of the traces, both the left and right panels for some reason automatically updated the "time" to 06:40-06:45 (it decreased by one hour), thus making the right panel not being able to find the trace.

I deploy Grafana using Helm, this is my values.yaml, any clues what could be wrong or where should I look for?

    additionalDataSources:
      - name: Loki
        uid: loki
        access: proxy
        editable: false
        jsonData:
          maxLines: 1000
        orgId: 1
        type: loki
        timeout: 180
        url: http://loki-gateway.monitoring
      - name: Tempo
        type: tempo
        uid: tempo
        url: http://monitoring-tempo-query-frontend.monitoring:3100
        access: proxy
        basicAuth: false
        jsonData:
          tracesToLogsV2:
            # Field with an internal link pointing to a logs data source in Grafana.
            # datasourceUid value must match the uid value of the logs data source.
            datasourceUid: "loki"
            spanStartTimeShift: "1h"
            spanEndTimeShift: "-1h"
            tags: ["job", "instance", "pod", "namespace"]
            filterByTraceID: false
            filterBySpanID: false
            customQuery: true
            query: 'method="${__span.tags.method}"'
          tracesToMetrics:
            datasourceUid: "prom"
            spanStartTimeShift: "1h"
            spanEndTimeShift: "-1h"
            tags: [{ key: "service.name", value: "service" }, { key: "job" }]
            queries:
              - name: "Sample query"
                query: "sum(rate(traces_spanmetrics_latency_bucket{$$__tags}[5m]))"
          serviceMap:
            datasourceUid: "prometheus"
          nodeGraph:
            enabled: true
          search:
            hide: false
          lokiSearch:
            datasourceUid: "loki"
          traceQuery:
            timeShiftEnabled: true
            spanStartTimeShift: "1h"
            spanEndTimeShift: "-1h"
          spanBar:
            type: "Tag"
            tag: "http.path"

image

AlissonRS avatar May 08 '24 20:05 AlissonRS

When I clicked on one of the traces, both the left and right panels for some reason automatically updated the "time" to 06:40-06:45 (it decreased by one hour), thus making the right panel not being able to find the trace.

Possibly a tz issue. Grafana only passes the time to Tempo in certain circumstances though. Can you check the network tab to see if the time is even being passed on the trace by id query?

joe-elliott avatar May 09 '24 11:05 joe-elliott

@joe-elliott I confirmed this to be a bug, not related to timezone. Reported here grafana/grafana#87608.

It's worth noticing that if you setup time shift like this:

          traceQuery:
            timeShiftEnabled: true
            spanStartTimeShift: "1h"
            spanEndTimeShift: "1h"

You would not get a 404 error, because the time range would just increase recursively (subtract 1h from start, add 1h to end), but if anyone is following Grafana Docs, which looks like this:

          traceQuery:
            timeShiftEnabled: true
            spanStartTimeShift: "1h"
            spanEndTimeShift: "-1h"

Then you'll eventually get a 404, because the "end" also gets subtracted by 1h every time you click a new trace. The shorter the time range, the faster you get 404 because the time range eventually go outside the trace's actual time.

AlissonRS avatar May 10 '24 04:05 AlissonRS

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

github-actions[bot] avatar Jul 10 '24 00:07 github-actions[bot]