tempo icon indicating copy to clipboard operation
tempo copied to clipboard

Error ResourceExhausted desc = grpc: received message larger than max

Open irizzant opened this issue 3 years ago • 4 comments

Describe the bug I am receiving the following error while querying Tempo for a specific trace id through Grafana Explore:

failed to get trace with id: e7b55c0454a261c49c691c37e4964f88 Status: 500 Internal Server Error Body: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (6735537 vs. 4194304) 

I have a Java application producing traces instrumented with https://github.com/open-telemetry/opentelemetry-java . The application logs are clean and traces are exported through otel exporter via grpc to my Tempo backend successfully. I'm able to successfully display in Tempo other traces produced from the same app.

I read https://github.com/grafana/tempo/issues/860 and so I set max_recv_msg_size_mib, here is the distributor tempo.yaml

tempo.yaml:
----
compactor: {}
distributor:
    receivers:
        jaeger:
            protocols:
                grpc: null
                thrift_http: null
        otlp:
            protocols:
                grpc:
                    max_recv_msg_size_mib: 134
                http: null
http_api_prefix: ""
ingester:
    lifecycler:
        ring:
            replication_factor: 3
memberlist:
    abort_if_cluster_join_fails: false
    bind_port: 7946
    join_members:
      - gossip-ring.grafana-test.svc.cluster.local:7946
overrides:
    per_tenant_override_config: /overrides/overrides.yaml
search_enabled: false
server:
    http_listen_port: 3200
storage:
    trace:
        backend: s3
        blocklist_poll: "0"
        cache: memcached
        gcs:
            bucket_name: tempo
            chunk_buffer_size: 1.048576e+07
        memcached:
            consistent_hash: true
            host: memcached
            service: memcached-client
            timeout: 200ms
        pool:
            queue_depth: 2000
        s3:
            access_key: tempo
            bucket: tempo
            endpoint: minio:9000
            insecure: true
            secret_key: supersecret
        wal:
            path: /var/tempo/wal

And my overrides.yaml:

overrides.yaml:
----
overrides:
    '*':
        ingestion_burst_size_bytes: 2e+07
        ingestion_rate_limit_bytes: 2e+07
        max_bytes_per_trace: 3e+07
        max_traces_per_user: 100000

And my querier tempo.yaml:

tempo.yaml:
----
compactor: {}
distributor: {}
http_api_prefix: ""
ingester:
    lifecycler:
        ring:
            replication_factor: 3
memberlist:
    abort_if_cluster_join_fails: false
    bind_port: 7946
    join_members:
      - gossip-ring.grafana-test.svc.cluster.local:7946
overrides:
    per_tenant_override_config: /overrides/overrides.yaml
querier:
    frontend_worker:
        frontend_address: query-frontend-discovery.grafana-test.svc.cluster.local:9095
        grpc_client_config:
            max_recv_msg_size: 1.34217728e+08
            max_send_msg_size: 1.34217728e+08
search_enabled: false
server:
    grpc_server_max_recv_msg_size: 1.34217728e+08
    grpc_server_max_send_msg_size: 1.34217728e+08
    http_listen_port: 3200
    log_level: debug
storage:
    trace:
        backend: s3
        blocklist_poll: 5m
        cache: memcached
        gcs:
            bucket_name: tempo
            chunk_buffer_size: 1.048576e+07
        memcached:
            consistent_hash: true
            host: memcached
            service: memcached-client
            timeout: 200ms
        pool:
            max_workers: 200
            queue_depth: 2000
        s3:
            access_key: tempo
            bucket: tempo
            endpoint: minio:9000
            insecure: true
            secret_key: supersecret
        wal:
            path: /var/tempo/wal

Still I get the error above.

To Reproduce Steps to reproduce the behavior:

  1. Start Tempo with Tanka microservices configuration, enable otel receiver and specify max_recv_msg_size_mib
  2. export to tempo traces bigger than 4194304 bytes

Expected behavior

Tempo should be able to retrieve the trace.

Environment:

  • Infrastructure: Kubernetes (k3d)
  • Deployment tool: jsonnet (tanka) in microservices mode https://github.com/grafana/tempo/tree/main/example/tk/tempo-microservices

Additional Context

irizzant avatar Nov 03 '21 18:11 irizzant

#860 is about ingesting really large traces, you are encountering this error while querying. It's the same issue, but it's happening in a different location.

We also use GRPC between the Tempo querier and query-frontend, so most likely you are fetching a trace which combined is too big to be received by the query-frontend. To solve this, increase grpc_server_max_recv_msg_size and grpc_server_max_send_msg_size in the server block: https://grafana.com/docs/tempo/latest/configuration/#server

You already did this for the querier, but I think you also have to do the same for the query-frontend.

kvrhdn avatar Nov 03 '21 18:11 kvrhdn

@kvrhdn thanks for the hint.

Using the following tempo.yaml in query frontend fixed the issue:

tempo.yaml:
----
compactor: {}
distributor: {}
http_api_prefix: ""
ingester:
    lifecycler:
        ring:
            replication_factor: 3
memberlist:
    abort_if_cluster_join_fails: false
    bind_port: 7946
    join_members:
      - gossip-ring.grafana-test.svc.cluster.local:7946
overrides:
    per_tenant_override_config: /overrides/overrides.yaml
querier:
    frontend_worker:
        grpc_client_config:
            max_recv_msg_size: 1.34217728e+08
            max_send_msg_size: 1.34217728e+08
search_enabled: false
server:
    grpc_server_max_recv_msg_size: 1.34217728e+08
    grpc_server_max_send_msg_size: 1.34217728e+08
    http_listen_port: 3200
storage:
    trace:
        backend: s3
        blocklist_poll: "0"
        cache: memcached
        gcs:
            bucket_name: tempo
            chunk_buffer_size: 1.048576e+07
        memcached:
            consistent_hash: true
            host: memcached
            service: memcached-client
            timeout: 200ms
        pool:
            queue_depth: 2000
        s3:
            access_key: tempo
            bucket: tempo
            endpoint: minio:9000
            insecure: true
            secret_key: supersecret
        wal:
            path: /var/tempo/wal

There is something I'd like to clarify if possible. It's my understanding that trace queries are sent to query-frontend component which in turn forwards them to the querier. Shoud query-frontend configuration mirror the querier's then?

For example, if I set the grpc server with grpc_server_max_recv_msg_size and grpc_server_max_send_msg_size, should they be the same in both query-frontend and querier?

Also, should the grpc client config mirror the server one? For example is it necessary to add the following to both query-frontend and querier?

frontend_worker:
        grpc_client_config:
            max_recv_msg_size: 1.34217728e+08
            max_send_msg_size: 1.34217728e+08`

irizzant avatar Nov 04 '21 09:11 irizzant

I had a similar issue today trying to pull up some traces in grafana.

failed to get trace with id: e24f83e9e4fc4a24 Status: 500 Internal Server Error Body: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5375787 vs. 4194304)

Slack thread: https://grafana.slack.com/archives/C01D981PEE5/p1641881072031600

james-callahan avatar Jan 11 '22 07:01 james-callahan

In the same thread @james-callahan mentioned it could be a good idea to emit a warning if GRPC settings are lower than max_bytes_per_trace. +1 from me on that.

annanay25 avatar Jan 11 '22 07:01 annanay25

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

github-actions[bot] avatar Nov 21 '22 00:11 github-actions[bot]