tempo icon indicating copy to clipboard operation
tempo copied to clipboard

Service graph not created from Zipkin trace

Open mecseid opened this issue 2 years ago • 11 comments

Describe the bug Made some request like the attached trace JSON, but the metrics_generator not created any metric about service graph.

To Reproduce I attached the JSON export of the trace. trace.json.txt

Expected behavior Generate service graph with 4 service (api-gateway, sales-service, crm-service, document-generator-service).

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: helm
  • Version: 1.4.1

Configuration Attached tempo.yaml file:

    metrics_generator_enabled: true
    metrics_generator:
      storage:
        path: /var/tempo/generator/wal
        remote_write:
          - url: http://rancher-monitoring-prometheus.cattle-monitoring-system.svc:9090/api/v1/write
            send_exemplars: true
    multitenancy_enabled: false
    search_enabled: true
    compactor:
      compaction:
        compacted_block_retention: 744h
    distributor:
      receivers:
            opencensus: null
            zipkin:
              endpoint: 0.0.0.0:9411
    ingester:
          {}
    server:
          http_listen_port: 3100
    storage:
          trace:
            backend: local
            local:
              path: /var/tempo/traces
            wal:
              path: /var/tempo/wal
    overrides:
          metrics_generator_processors:
          - service-graphs
          - span-metrics
          per_tenant_override_config: /conf/overrides.yaml

mecseid avatar Jun 16 '22 12:06 mecseid

Zipkin has a "shared span" approach to RPC calls. In this case the client and server span share the same span IDs. We account for that on the query path, but I'm wondering if the service graph code handles it as well.

https://zipkin.io/zipkin/2.11.3/zipkin/zipkin2/Span.html#shared-- "When an RPC trace is client-originated, it will be sampled and the same span ID is used for the server side."

joe-elliott avatar Jun 16 '22 13:06 joe-elliott

From a glance, I can't really see why this would trip up the service graphs processor. I will have to write a test and dig a bit deeper into how Zipkin traces are processed internally.

kvrhdn avatar Jun 16 '22 13:06 kvrhdn

@joe-elliott @kvrhdn Thank you so much! We make the traces with Spring Cloud Sleuth, which has stable integration only to Zipkin, and an experimental to OTel (as far as I know) 😞.

If I can help anyhow, please tell me.

mecseid avatar Jun 16 '22 13:06 mecseid

Hey Peeps! Is there anything that I can help with this?

nicolastakashi avatar Aug 01 '22 11:08 nicolastakashi

@nicolastakashi we don't have any zipkin instrumented applications internally so this makes it quite difficult for us to diagnose and fix this. If you are producing zipkin traces (and feel up to the work), we'd love to see a PR to the metrics generator that makes service graphs work for zipkin.

The basic problem is that the client and server spans share the same trace id and the same parent id in zipkin. The current code expects the server span to have a parent id equal to the client span's id. It uses this relationship to match up two spans and create an edge:

https://github.com/grafana/tempo/blob/main/modules/generator/processor/servicegraphs/servicegraphs.go#L138

It would require some thought, but I'm hopeful we can create logic that would correctly process both types of spans.

Jaeger/OTel:

server span (id: 1, parent id: 0)
  client span (id: 2, parent id: 1)

Zipkin:

server span (id: 1, parent id: 0)
  client span (id: 1, parent id: 0)

We have code that does this on the query path: https://github.com/grafana/tempo/blob/main/modules/frontend/deduper.go#L125

but I'm not sure how applicable it is here.

joe-elliott avatar Aug 02 '22 15:08 joe-elliott

@joe-elliott I'll try my best to setup it locally, I've services sending Zipkin spans and I guess I can debug it quite easily.

nicolastakashi avatar Aug 03 '22 07:08 nicolastakashi

Hey @joe-elliott I could run it locally and debug it, with some effort I guess I could understand the issue that you've described.

Let me share a few tools that can help you in the future to test Zipkin scenarios

1 - There is a repository where you can find JSON examples of Zipkin traces and send a post to Tempo Distributor as you can see here.

2 - If I'm not wrong the synthetic-load-generator has an option where you can pass a flag called --zipkinV2JsonUrl there we can also point to the Distributor.

Regarding the main issue, if a little bit of effort, I guess I could understand that and I would like to check if you before moving forward.

I was thinking to use the same dedupe strategy mentioned here before executing the consume function on the servicegraphs.go

I've checked and I guess that contract is different and because of this we should duplicate de code or try to abstract the deduper module.

WDYT?

nicolastakashi avatar Aug 07 '22 15:08 nicolastakashi

The issue with using the frontend deduper is that it expects the entire trace to be present. Currently the service graph processor holds parent or child spans in memory and watches for the relevant match. The way a "match" is lined up is what will need to change.

Right now I believe it's looking for two spans with kind server and client. The server span's parent id has to match the client span's id. Also, recently support for producer/consumer was added as well, but with the same span id requirements.

joe-elliott avatar Aug 08 '22 12:08 joe-elliott

@joe-elliott If I disable this shared span functionality, will it work as a workaround?

To make the question specific for Spring Sleuth, here is the corresponding property: https://docs.spring.io/spring-cloud-sleuth/docs/current/reference/html/appendix.html

spring.sleuth.supports-join (default true)
True means the tracing system supports sharing a span ID between a client and server.

mecseid avatar Aug 14 '22 17:08 mecseid

I do know that "shared span" is what zipkin calls the model where the same span id is shared between client and server, but I have no idea what that setting does :).

Give it a try and let us know!

joe-elliott avatar Aug 15 '22 18:08 joe-elliott

@joe-elliott Update for https://github.com/grafana/tempo/issues/1495#issuecomment-1214421927: Looks like, if I disable this function, then it works as expected, so whoever use Spring Sleuth can disable this function and use Tempo as expected. :)

mecseid avatar Aug 19 '22 09:08 mecseid

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

github-actions[bot] avatar Nov 14 '22 00:11 github-actions[bot]

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

github-actions[bot] avatar Feb 04 '23 00:02 github-actions[bot]