tempo icon indicating copy to clipboard operation
tempo copied to clipboard

Stop signal is not handled gracefully by query-frontend

Open ivanmartos opened this issue 6 months ago • 4 comments

Describe the bug When tempo is deployed on k8s via helm chart and k8s sends SIGTERM signal to query-frontend pod due to preemption in order to admit critical pod, query-frontend pod fails to gracefully stop in 30s default window. This means that k8s has to kill it via SIGKILL signal and then this pod will be hanging in error state. As seen in the logs, multiple modules are stopped but it keep logging message below until it is killed

level=warn ts=2025-10-11T00:29:52.146919925Z caller=server.go:1539 method=/frontend.Frontend/Process duration=374.843µs msg=gRPC err="queue is stopped" 
Full log 2025-10-11 02:30:22.086 level=warn ts=2025-10-11T00:30:22.086496213Z caller=server.go:1539 method=/frontend.Frontend/Process duration=332.034µs msg=gRPC err="queue is stopped" 2025-10-11 02:30:22.025 level=warn ts=2025-10-11T00:30:22.025431275Z caller=server.go:1539 method=/frontend.Frontend/Process duration=343.21µs msg=gRPC err="queue is stopped" 2025-10-11 02:30:21.884

..... same log multiple times.....

level=warn ts=2025-10-11T00:29:52.165841052Z caller=server.go:1539 method=/frontend.Frontend/Process duration=282.481µs msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.153 level=warn ts=2025-10-11T00:29:52.153100985Z caller=server.go:1539 method=/frontend.Frontend/Process duration=261.76µs msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.149 level=warn ts=2025-10-11T00:29:52.149313234Z caller=server.go:1539 method=/frontend.Frontend/Process duration=260.199µs msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.149 level=warn ts=2025-10-11T00:29:52.149057059Z caller=server.go:1539 method=/frontend.Frontend/Process duration=355.81µs msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.146 level=warn ts=2025-10-11T00:29:52.14696915Z caller=server.go:1539 method=/frontend.Frontend/Process duration=414.484µs msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.146 level=warn ts=2025-10-11T00:29:52.146919925Z caller=server.go:1539 method=/frontend.Frontend/Process duration=374.843µs msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.093 level=info ts=2025-10-11T00:29:52.093738737Z caller=module_service.go:120 msg="module stopped" module=overrides 2025-10-11 02:29:52.093 level=info ts=2025-10-11T00:29:52.093691152Z caller=module_service.go:120 msg="module stopped" module=memberlist-kv 2025-10-11 02:29:52.093 level=info ts=2025-10-11T00:29:52.093679631Z caller=module_service.go:120 msg="module stopped" module=overrides-api 2025-10-11 02:29:52.093 level=info ts=2025-10-11T00:29:52.093659119Z caller=module_service.go:120 msg="module stopped" module=query-frontend 2025-10-11 02:29:52.093 level=warn ts=2025-10-11T00:29:52.093637758Z caller=server.go:1539 method=/frontend.Frontend/Process duration=10h31m54.031021272s msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.093 level=warn ts=2025-10-11T00:29:52.093621838Z caller=server.go:1539 method=/frontend.Frontend/Process duration=10h31m54.031005288s msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.093 level=warn ts=2025-10-11T00:29:52.093590661Z caller=server.go:1539 method=/frontend.Frontend/Process duration=10h32m34.040844602s msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.093 level=warn ts=2025-10-11T00:29:52.093473889Z caller=server.go:1539 method=/frontend.Frontend/Process duration=10h32m34.040857282s msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.093 level=warn ts=2025-10-11T00:29:52.093457553Z caller=server.go:1539 method=/frontend.Frontend/Process duration=10h32m34.040658244s msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.093 level=warn ts=2025-10-11T00:29:52.093443544Z caller=server.go:1539 method=/frontend.Frontend/Process duration=10h32m34.040680388s msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.093 level=warn ts=2025-10-11T00:29:52.093427272Z caller=server.go:1539 method=/frontend.Frontend/Process duration=10h32m34.040643844s msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.093 level=warn ts=2025-10-11T00:29:52.093412296Z caller=server.go:1539 method=/frontend.Frontend/Process duration=10h31m54.030680343s msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.093 level=warn ts=2025-10-11T00:29:52.093391703Z caller=server.go:1539 method=/frontend.Frontend/Process duration=10h31m54.030704815s msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.093 level=warn ts=2025-10-11T00:29:52.093245139Z caller=server.go:1539 method=/frontend.Frontend/Process duration=10h31m54.030534339s msg=gRPC err="queue is stopped" 2025-10-11 02:29:52.092 level=info ts=2025-10-11T00:29:52.09272838Z caller=app.go:239 msg="=== received SIGINT/SIGTERM ===\n*** exiting"

To Reproduce Steps to reproduce the behavior:

  1. start tempo from helm chart
  2. send sigterm

Expected behavior query-fronted should be able to handle termination gracefully and properly exit

Environment:

  • Infrastructure: kubernetes
  • Deployment tool: helm chart 1.47.1

Additional Context

This issue happened at 2:30 am so there were no queries executed / activity at the time

Perhaps this is also related to https://github.com/grafana/tempo/issues/2847

ivanmartos avatar Oct 13 '25 10:10 ivanmartos

There is a "shutdown delay" configuration option here:

https://github.com/grafana/tempo/blob/main/cmd/tempo/app/config.go#L93

Is this being set by any chance?

joe-elliott avatar Oct 13 '25 21:10 joe-elliott

we do not override it

Here is full tempo.yaml for reference (from ConfigMap)

    cache:
      caches:
      - memcached:
          consistent_hash: true
          host: 'tempo-memcached'
          service: memcached-client
          timeout: 500ms
        roles:
        - parquet-footer
        - bloom
        - frontend-search
    compactor:
      compaction:
        block_retention: 2160h
        compacted_block_retention: 1h
        compaction_cycle: 30s
        compaction_window: 1h
        max_block_bytes: 1073741824
        max_compaction_objects: 6000000
        max_time_per_tenant: 5m
        retention_concurrency: 1
        v2_in_buffer_bytes: 5242880
        v2_out_buffer_bytes: 20971520
        v2_prefetch_traces_count: 100
      ring:
        kvstore:
          store: memberlist
    distributor:
      receivers:
        otlp:
          protocols:
            http:
              endpoint: 0.0.0.0:4318
      ring:
        kvstore:
          store: memberlist
    ingester:
      flush_all_on_shutdown: true
      flush_check_period: 5s
      lifecycler:
        ring:
          kvstore:
            store: memberlist
          replication_factor: 1
        tokens_file_path: /var/tempo/tokens.json
      max_block_bytes: 52428800
      max_block_duration: 1m
      trace_idle_period: 5s
    memberlist:
      abort_if_cluster_join_fails: false
      bind_addr: []
      bind_port: 7946
      cluster_label: 'tempo-deployment.tempo'
      gossip_interval: 1s
      gossip_nodes: 2
      gossip_to_dead_nodes_time: 30s
      join_members:
      - dns+tempo-gossip-ring:7946
      leave_timeout: 5s
      left_ingesters_timeout: 5m
      max_join_backoff: 1m
      max_join_retries: 10
      min_join_backoff: 1s
      node_name: ""
      packet_dial_timeout: 5s
      packet_write_timeout: 5s
      pull_push_interval: 30s
      randomize_node_name: true
      rejoin_interval: 0s
      retransmit_factor: 2
      stream_timeout: 10s
    metrics_generator:
      metrics_ingestion_time_range_slack: 30s
      processor:
        service_graphs:
          dimensions: []
          histogram_buckets:
          - 0.1
          - 0.2
          - 0.4
          - 0.8
          - 1.6
          - 3.2
          - 6.4
          - 12.8
          max_items: 10000
          wait: 10s
          workers: 10
        span_metrics:
          dimensions: []
          histogram_buckets:
          - 0.002
          - 0.004
          - 0.008
          - 0.016
          - 0.032
          - 0.064
          - 0.128
          - 0.256
          - 0.512
          - 1.02
          - 2.05
          - 4.1
      registry:
        collection_interval: 15s
        external_labels: {}
        stale_duration: 15m
      ring:
        kvstore:
          store: memberlist
      storage:
        path: /var/tempo/wal
        remote_write:
        - url: http://mimir-gateway.mimir.svc.cluster.local:80/api/v1/push
        remote_write_add_org_id_header: true
        remote_write_flush_deadline: 1m
      traces_storage:
        path: /var/tempo/traces
    multitenancy_enabled: true
    overrides:
      defaults:
        ingestion:
          burst_size_bytes: 500000
          rate_limit_bytes: 500000
          rate_strategy: global
      per_tenant_override_config: /runtime-config/overrides.yaml
    querier:
      frontend_worker:
        frontend_address: tempo-query-frontend-discovery:9095
      max_concurrent_queries: 10
      search:
        query_timeout: 30s
      trace_by_id:
        query_timeout: 10s
    query_frontend:
      max_outstanding_per_tenant: 50
      max_retries: 2
      metrics:
        concurrent_jobs: 1000
        duration_slo: 0s
        interval: 5m
        max_duration: 3h
        query_backend_after: 30m
        target_bytes_per_job: 104857600
        throughput_bytes_slo: 0
      search:
        concurrent_jobs: 10
        max_result_limit: 100
        max_spans_per_span_set: 100
        query_backend_after: 0
        query_ingesters_until: 0
        target_bytes_per_job: 52428800
      trace_by_id:
        query_shards: 50
    server:
      grpc_server_max_recv_msg_size: 8388608
      grpc_server_max_send_msg_size: 8388608
      http_listen_port: 3200
      http_server_read_timeout: 30s
      http_server_write_timeout: 30s
      log_format: logfmt
      log_level: info
    storage:
      trace:
        azure:
          container_name: <REDACTED>
          endpoint_suffix: blob.core.windows.net
          storage_account_key: <REDACTED>
          storage_account_name: <REDACTED>
        backend: azure
        blocklist_poll: 5m
        local:
          path: /var/tempo/traces
        pool:
          max_workers: 400
          queue_depth: 20000
        search:
          prefetch_trace_count: 1000
        wal:
          path: /var/tempo/wal
    stream_over_http_enabled: true
    usage_report:
      reporting_enabled: false

ivanmartos avatar Oct 14 '25 06:10 ivanmartos

Thx for providing your config! We did just merge this. Perhaps it helps?

https://github.com/grafana/tempo/pull/5725#event-20305782224

joe-elliott avatar Oct 16 '25 14:10 joe-elliott

thx, I will test it once the change will be released :)

ivanmartos avatar Oct 17 '25 10:10 ivanmartos

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

github-actions[bot] avatar Dec 17 '25 00:12 github-actions[bot]