tempo icon indicating copy to clipboard operation
tempo copied to clipboard

Grafana-agent incorrectly tag traces received from Kubernetes pod

Open chenfeilee opened this issue 3 years ago • 4 comments

Describe the bug We have deployed grafana-agent to AWS EKS for collection of traces. We were seeing different tag values (via the kubernetes service discovery) for different spans of the same trace.

Seem like it was because the running pod (which the traces were being sent from) has the same private ip address as another completed pod.

To Reproduce Steps to reproduce the behavior:

  1. Deploy grafana-agent on AWS EKS
  2. Start sending traces to grafana-agent
  3. Look for trace in Grafana

Expected behavior Spans being tagged correctly (in this case would be to only include metadata from the running pod at the time when the traces were received)

Environment:

  • Infrastructure: AWS EKS

Additional Context

image image image image

chenfeilee avatar Feb 14 '22 15:02 chenfeilee

@joe-elliott any thoughts on this? thanks.

chenfeilee avatar Feb 14 '22 15:02 chenfeilee

Hi @chenfeilee! The Grafana Agent uses Prometheus Service Discovery to get metadata from Kubernetes. It should be possible to drop Succeeded pods using PromSD config.

When using role: pod, there is a metadata label __meta_kubernetes_pod_phase, that contains the pod's status. You could use that label to add a relabel config and drop targets that have status Suceeded. This way, completed pods will be dropped and traces will be tagged only with Running pods' metadata.

Example config:

traces:
    configs:
        - name: <name>
        ...
        scrape_configs:
            - job_name: <job-name>
              kubernetes_sd_configs:
                - role: pod
              relabel_configs:
                - action: drop
                  source_labels:
                    - __meta_kubernetes_pod_phase
                  regex: Succeeded
                  ....

mapno avatar Feb 15 '22 11:02 mapno

@mapno I have given it a try and it works! thank you so much!

One question though: do I have to drop the Failed pods too to avoid the same issue? or the same issue only occurs to Succeeded pods?

chenfeilee avatar Feb 16 '22 14:02 chenfeilee

That's great to hear!

One question though: do I have to drop the Failed pods too to avoid the same issue? or the same issue only occurs to Succeeded pods?

I think so, yes. Another alternative is using the action: keep to keep only targets that match Running. That should be foolproof, since pods can also be in Pending and Unknown state. I don't think you'll want to keep any of those either.

mapno avatar Feb 16 '22 17:02 mapno

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

github-actions[bot] avatar Nov 16 '22 00:11 github-actions[bot]