Stop signal is not handled gracefully by query-frontend
Describe the bug
When tempo is deployed on k8s via helm chart and k8s sends SIGTERM signal to query-frontend pod due to preemption in order to admit critical pod, query-frontend pod fails to gracefully stop in 30s default window. This means that k8s has to kill it via SIGKILL signal and then this pod will be hanging in error state. As seen in the logs, multiple modules are stopped but it keep logging message below until it is killed
level=warn ts=2025-10-11T00:29:52.146919925Z caller=server.go:1539 method=/frontend.Frontend/Process duration=374.843µs msg=gRPC err="queue is stopped"
Full log
2025-10-11 02:30:22.086 level=warn ts=2025-10-11T00:30:22.086496213Z caller=server.go:1539 method=/frontend.Frontend/Process duration=332.034µs msg=gRPC err="queue is stopped" 2025-10-11 02:30:22.025 level=warn ts=2025-10-11T00:30:22.025431275Z caller=server.go:1539 method=/frontend.Frontend/Process duration=343.21µs msg=gRPC err="queue is stopped" 2025-10-11 02:30:21.884..... same log multiple times.....
level=warn ts=2025-10-11T00:29:52.165841052Z caller=server.go:1539 method=/frontend.Frontend/Process duration=282.481µs msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.153 level=warn ts=2025-10-11T00:29:52.153100985Z caller=server.go:1539 method=/frontend.Frontend/Process duration=261.76µs msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.149 level=warn ts=2025-10-11T00:29:52.149313234Z caller=server.go:1539 method=/frontend.Frontend/Process duration=260.199µs msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.149 level=warn ts=2025-10-11T00:29:52.149057059Z caller=server.go:1539 method=/frontend.Frontend/Process duration=355.81µs msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.146 level=warn ts=2025-10-11T00:29:52.14696915Z caller=server.go:1539 method=/frontend.Frontend/Process duration=414.484µs msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.146 level=warn ts=2025-10-11T00:29:52.146919925Z caller=server.go:1539 method=/frontend.Frontend/Process duration=374.843µs msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.093 level=info ts=2025-10-11T00:29:52.093738737Z caller=module_service.go:120 msg="module stopped" module=overrides 2025-10-11 02:29:52.093 level=info ts=2025-10-11T00:29:52.093691152Z caller=module_service.go:120 msg="module stopped" module=memberlist-kv 2025-10-11 02:29:52.093 level=info ts=2025-10-11T00:29:52.093679631Z caller=module_service.go:120 msg="module stopped" module=overrides-api 2025-10-11 02:29:52.093 level=info ts=2025-10-11T00:29:52.093659119Z caller=module_service.go:120 msg="module stopped" module=query-frontend 2025-10-11 02:29:52.093 level=warn ts=2025-10-11T00:29:52.093637758Z caller=server.go:1539 method=/frontend.Frontend/Process duration=10h31m54.031021272s msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.093 level=warn ts=2025-10-11T00:29:52.093621838Z caller=server.go:1539 method=/frontend.Frontend/Process duration=10h31m54.031005288s msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.093 level=warn ts=2025-10-11T00:29:52.093590661Z caller=server.go:1539 method=/frontend.Frontend/Process duration=10h32m34.040844602s msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.093 level=warn ts=2025-10-11T00:29:52.093473889Z caller=server.go:1539 method=/frontend.Frontend/Process duration=10h32m34.040857282s msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.093 level=warn ts=2025-10-11T00:29:52.093457553Z caller=server.go:1539 method=/frontend.Frontend/Process duration=10h32m34.040658244s msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.093 level=warn ts=2025-10-11T00:29:52.093443544Z caller=server.go:1539 method=/frontend.Frontend/Process duration=10h32m34.040680388s msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.093 level=warn ts=2025-10-11T00:29:52.093427272Z caller=server.go:1539 method=/frontend.Frontend/Process duration=10h32m34.040643844s msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.093 level=warn ts=2025-10-11T00:29:52.093412296Z caller=server.go:1539 method=/frontend.Frontend/Process duration=10h31m54.030680343s msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.093 level=warn ts=2025-10-11T00:29:52.093391703Z caller=server.go:1539 method=/frontend.Frontend/Process duration=10h31m54.030704815s msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.093 level=warn ts=2025-10-11T00:29:52.093245139Z caller=server.go:1539 method=/frontend.Frontend/Process duration=10h31m54.030534339s msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.092 level=info ts=2025-10-11T00:29:52.09272838Z caller=app.go:239 msg="=== received SIGINT/SIGTERM ===\n*** exiting"
To Reproduce Steps to reproduce the behavior:
- start tempo from helm chart
- send sigterm
Expected behavior query-fronted should be able to handle termination gracefully and properly exit
Environment:
- Infrastructure: kubernetes
- Deployment tool: helm chart 1.47.1
Additional Context
This issue happened at 2:30 am so there were no queries executed / activity at the time
Perhaps this is also related to https://github.com/grafana/tempo/issues/2847
There is a "shutdown delay" configuration option here:
https://github.com/grafana/tempo/blob/main/cmd/tempo/app/config.go#L93
Is this being set by any chance?
we do not override it
Here is full tempo.yaml for reference (from ConfigMap)
cache:
caches:
- memcached:
consistent_hash: true
host: 'tempo-memcached'
service: memcached-client
timeout: 500ms
roles:
- parquet-footer
- bloom
- frontend-search
compactor:
compaction:
block_retention: 2160h
compacted_block_retention: 1h
compaction_cycle: 30s
compaction_window: 1h
max_block_bytes: 1073741824
max_compaction_objects: 6000000
max_time_per_tenant: 5m
retention_concurrency: 1
v2_in_buffer_bytes: 5242880
v2_out_buffer_bytes: 20971520
v2_prefetch_traces_count: 100
ring:
kvstore:
store: memberlist
distributor:
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
ring:
kvstore:
store: memberlist
ingester:
flush_all_on_shutdown: true
flush_check_period: 5s
lifecycler:
ring:
kvstore:
store: memberlist
replication_factor: 1
tokens_file_path: /var/tempo/tokens.json
max_block_bytes: 52428800
max_block_duration: 1m
trace_idle_period: 5s
memberlist:
abort_if_cluster_join_fails: false
bind_addr: []
bind_port: 7946
cluster_label: 'tempo-deployment.tempo'
gossip_interval: 1s
gossip_nodes: 2
gossip_to_dead_nodes_time: 30s
join_members:
- dns+tempo-gossip-ring:7946
leave_timeout: 5s
left_ingesters_timeout: 5m
max_join_backoff: 1m
max_join_retries: 10
min_join_backoff: 1s
node_name: ""
packet_dial_timeout: 5s
packet_write_timeout: 5s
pull_push_interval: 30s
randomize_node_name: true
rejoin_interval: 0s
retransmit_factor: 2
stream_timeout: 10s
metrics_generator:
metrics_ingestion_time_range_slack: 30s
processor:
service_graphs:
dimensions: []
histogram_buckets:
- 0.1
- 0.2
- 0.4
- 0.8
- 1.6
- 3.2
- 6.4
- 12.8
max_items: 10000
wait: 10s
workers: 10
span_metrics:
dimensions: []
histogram_buckets:
- 0.002
- 0.004
- 0.008
- 0.016
- 0.032
- 0.064
- 0.128
- 0.256
- 0.512
- 1.02
- 2.05
- 4.1
registry:
collection_interval: 15s
external_labels: {}
stale_duration: 15m
ring:
kvstore:
store: memberlist
storage:
path: /var/tempo/wal
remote_write:
- url: http://mimir-gateway.mimir.svc.cluster.local:80/api/v1/push
remote_write_add_org_id_header: true
remote_write_flush_deadline: 1m
traces_storage:
path: /var/tempo/traces
multitenancy_enabled: true
overrides:
defaults:
ingestion:
burst_size_bytes: 500000
rate_limit_bytes: 500000
rate_strategy: global
per_tenant_override_config: /runtime-config/overrides.yaml
querier:
frontend_worker:
frontend_address: tempo-query-frontend-discovery:9095
max_concurrent_queries: 10
search:
query_timeout: 30s
trace_by_id:
query_timeout: 10s
query_frontend:
max_outstanding_per_tenant: 50
max_retries: 2
metrics:
concurrent_jobs: 1000
duration_slo: 0s
interval: 5m
max_duration: 3h
query_backend_after: 30m
target_bytes_per_job: 104857600
throughput_bytes_slo: 0
search:
concurrent_jobs: 10
max_result_limit: 100
max_spans_per_span_set: 100
query_backend_after: 0
query_ingesters_until: 0
target_bytes_per_job: 52428800
trace_by_id:
query_shards: 50
server:
grpc_server_max_recv_msg_size: 8388608
grpc_server_max_send_msg_size: 8388608
http_listen_port: 3200
http_server_read_timeout: 30s
http_server_write_timeout: 30s
log_format: logfmt
log_level: info
storage:
trace:
azure:
container_name: <REDACTED>
endpoint_suffix: blob.core.windows.net
storage_account_key: <REDACTED>
storage_account_name: <REDACTED>
backend: azure
blocklist_poll: 5m
local:
path: /var/tempo/traces
pool:
max_workers: 400
queue_depth: 20000
search:
prefetch_trace_count: 1000
wal:
path: /var/tempo/wal
stream_over_http_enabled: true
usage_report:
reporting_enabled: false
Thx for providing your config! We did just merge this. Perhaps it helps?
https://github.com/grafana/tempo/pull/5725#event-20305782224
thx, I will test it once the change will be released :)
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.