tempo
tempo copied to clipboard
Error ResourceExhausted desc = grpc: received message larger than max
Describe the bug I am receiving the following error while querying Tempo for a specific trace id through Grafana Explore:
failed to get trace with id: e7b55c0454a261c49c691c37e4964f88 Status: 500 Internal Server Error Body: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (6735537 vs. 4194304)
I have a Java application producing traces instrumented with https://github.com/open-telemetry/opentelemetry-java . The application logs are clean and traces are exported through otel exporter via grpc to my Tempo backend successfully. I'm able to successfully display in Tempo other traces produced from the same app.
I read https://github.com/grafana/tempo/issues/860 and so I set max_recv_msg_size_mib
, here is the distributor tempo.yaml
tempo.yaml:
----
compactor: {}
distributor:
receivers:
jaeger:
protocols:
grpc: null
thrift_http: null
otlp:
protocols:
grpc:
max_recv_msg_size_mib: 134
http: null
http_api_prefix: ""
ingester:
lifecycler:
ring:
replication_factor: 3
memberlist:
abort_if_cluster_join_fails: false
bind_port: 7946
join_members:
- gossip-ring.grafana-test.svc.cluster.local:7946
overrides:
per_tenant_override_config: /overrides/overrides.yaml
search_enabled: false
server:
http_listen_port: 3200
storage:
trace:
backend: s3
blocklist_poll: "0"
cache: memcached
gcs:
bucket_name: tempo
chunk_buffer_size: 1.048576e+07
memcached:
consistent_hash: true
host: memcached
service: memcached-client
timeout: 200ms
pool:
queue_depth: 2000
s3:
access_key: tempo
bucket: tempo
endpoint: minio:9000
insecure: true
secret_key: supersecret
wal:
path: /var/tempo/wal
And my overrides.yaml:
overrides.yaml:
----
overrides:
'*':
ingestion_burst_size_bytes: 2e+07
ingestion_rate_limit_bytes: 2e+07
max_bytes_per_trace: 3e+07
max_traces_per_user: 100000
And my querier tempo.yaml:
tempo.yaml:
----
compactor: {}
distributor: {}
http_api_prefix: ""
ingester:
lifecycler:
ring:
replication_factor: 3
memberlist:
abort_if_cluster_join_fails: false
bind_port: 7946
join_members:
- gossip-ring.grafana-test.svc.cluster.local:7946
overrides:
per_tenant_override_config: /overrides/overrides.yaml
querier:
frontend_worker:
frontend_address: query-frontend-discovery.grafana-test.svc.cluster.local:9095
grpc_client_config:
max_recv_msg_size: 1.34217728e+08
max_send_msg_size: 1.34217728e+08
search_enabled: false
server:
grpc_server_max_recv_msg_size: 1.34217728e+08
grpc_server_max_send_msg_size: 1.34217728e+08
http_listen_port: 3200
log_level: debug
storage:
trace:
backend: s3
blocklist_poll: 5m
cache: memcached
gcs:
bucket_name: tempo
chunk_buffer_size: 1.048576e+07
memcached:
consistent_hash: true
host: memcached
service: memcached-client
timeout: 200ms
pool:
max_workers: 200
queue_depth: 2000
s3:
access_key: tempo
bucket: tempo
endpoint: minio:9000
insecure: true
secret_key: supersecret
wal:
path: /var/tempo/wal
Still I get the error above.
To Reproduce Steps to reproduce the behavior:
- Start Tempo with Tanka microservices configuration, enable otel receiver and specify
max_recv_msg_size_mib
- export to tempo traces bigger than 4194304 bytes
Expected behavior
Tempo should be able to retrieve the trace.
Environment:
- Infrastructure: Kubernetes (k3d)
- Deployment tool: jsonnet (tanka) in microservices mode https://github.com/grafana/tempo/tree/main/example/tk/tempo-microservices
Additional Context
#860 is about ingesting really large traces, you are encountering this error while querying. It's the same issue, but it's happening in a different location.
We also use GRPC between the Tempo querier and query-frontend, so most likely you are fetching a trace which combined is too big to be received by the query-frontend.
To solve this, increase grpc_server_max_recv_msg_size
and grpc_server_max_send_msg_size
in the server
block: https://grafana.com/docs/tempo/latest/configuration/#server
You already did this for the querier, but I think you also have to do the same for the query-frontend.
@kvrhdn thanks for the hint.
Using the following tempo.yaml in query frontend fixed the issue:
tempo.yaml:
----
compactor: {}
distributor: {}
http_api_prefix: ""
ingester:
lifecycler:
ring:
replication_factor: 3
memberlist:
abort_if_cluster_join_fails: false
bind_port: 7946
join_members:
- gossip-ring.grafana-test.svc.cluster.local:7946
overrides:
per_tenant_override_config: /overrides/overrides.yaml
querier:
frontend_worker:
grpc_client_config:
max_recv_msg_size: 1.34217728e+08
max_send_msg_size: 1.34217728e+08
search_enabled: false
server:
grpc_server_max_recv_msg_size: 1.34217728e+08
grpc_server_max_send_msg_size: 1.34217728e+08
http_listen_port: 3200
storage:
trace:
backend: s3
blocklist_poll: "0"
cache: memcached
gcs:
bucket_name: tempo
chunk_buffer_size: 1.048576e+07
memcached:
consistent_hash: true
host: memcached
service: memcached-client
timeout: 200ms
pool:
queue_depth: 2000
s3:
access_key: tempo
bucket: tempo
endpoint: minio:9000
insecure: true
secret_key: supersecret
wal:
path: /var/tempo/wal
There is something I'd like to clarify if possible. It's my understanding that trace queries are sent to query-frontend component which in turn forwards them to the querier. Shoud query-frontend configuration mirror the querier's then?
For example, if I set the grpc server with grpc_server_max_recv_msg_size
and grpc_server_max_send_msg_size
, should they be the same in both query-frontend and querier?
Also, should the grpc client config mirror the server one? For example is it necessary to add the following to both query-frontend and querier?
frontend_worker:
grpc_client_config:
max_recv_msg_size: 1.34217728e+08
max_send_msg_size: 1.34217728e+08`
I had a similar issue today trying to pull up some traces in grafana.
failed to get trace with id: e24f83e9e4fc4a24 Status: 500 Internal Server Error Body: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5375787 vs. 4194304)
Slack thread: https://grafana.slack.com/archives/C01D981PEE5/p1641881072031600
In the same thread @james-callahan mentioned it could be a good idea to emit a warning if GRPC settings are lower than max_bytes_per_trace
. +1 from me on that.
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.