quickwit Kafka source issues after upgrade to edge: PollExceeded errors and consumption problems

Describe the bug After upgrading from commit qw-airmail-20250522-hotfix (488375a9) to edge (660388a42756a739d0ef0aecd234ca953b85caf5), we are experiencing Kafka source failures.

error_log.txt

2025-12-09 16:45:58.757 | PollExceeded (Local: Maximum application poll interval (max.poll.interval.ms) exceeded)) |  
-- | -- | --
  |   | 2025-12-09 16:45:58.757 | 2025-12-09T07:45:58.757Z ERROR quickwit_actors::actor_context: exit activating-kill-switch actor=SourceActor-proud-RYA5 exit_status=Failure(Message consumption error: PollExceeded (Local: Maximum application poll interval (max.poll.interval.ms) exceeded) |  
  |   | 2025-12-09 16:45:58.757 | PollExceeded (Local: Maximum application poll interval (max.poll.interval.ms) exceeded)) |  
  |   | 2025-12-09 16:45:58.757 | 2025-12-09T07:45:58.757Z ERROR quickwit_actors::spawn_builder: actor-exit actor_id="SourceActor-proud-RYA5" phase=handling(quickwit_indexing::source::Loop) exit_status=Failure(Message consumption error: PollExceeded (Local: Maximum application poll interval (max.poll.interval.ms) exceeded) |  
  |   | 2025-12-09 16:45:58.476 | 2025-12-09T07:45:58.476Z ERROR rdkafka::client: librdkafka: Global error: PollExceeded (Local: Maximum application poll interval (max.poll.interval.ms) exceeded): Application maximum poll interval (600000ms) exceeded by 16ms |  
  |   | 2025-12-09 16:45:50.710 | PollExceeded (Local: Maximum application poll interval (max.poll.interval.ms) exceeded)) |  
  |   | 2025-12-09 16:45:50.710 | 2025-12-09T07:45:50.710Z ERROR quickwit_actors::actor_context: exit activating-kill-switch actor=SourceActor-polished-D8qx exit_status=Failure(Message consumption error: PollExceeded (Local: Maximum application poll interval (max.poll.interval.ms) exceeded) |  
  |   | 2025-12-09 16:45:50.710 | PollExceeded (Local: Maximum application poll interval (max.poll.interval.ms) exceeded)) |  
  |   | 2025-12-09 16:45:50.710 | 2025-12-09T07:45:50.710Z ERROR quickwit_actors::spawn_builder: actor-exit actor_id="SourceActor-polished-D8qx" phase=handling(quickwit_indexing::source::Loop) exit_status=Failure(Message consumption error: PollExceeded (Local: Maximum application poll interval (max.poll.interval.ms) exceeded) |  
  |   | 2025-12-09 16:45:50.710 | 2025-12-09T07:45:50.710Z ERROR rdkafka::client: librdkafka: Global error: PollExceeded (Local: Maximum application poll interval (max.poll.interval.ms) exceeded): Application maximum poll interval (600000ms) exceeded by 185ms

Steps to reproduce (if applicable) Steps to reproduce the behavior:

Run indexing under sustained high load where backpressure occurs 2 .Observe PollExceeded errors or consumption stalls

Expected behavior Kafka source should continue consuming messages under high load without max.poll.interval.ms exceeded errors, as it did in the previous version (qw-airmail-20250522-hotfix).

Configuration: Please provide:

Quickwit Version(edge: 660388a42756a739d0ef0aecd234ca953b85caf5)
The index_config.yaml

version: 0.8

index_id: log.common.access_log_v2_quickwit

doc_mapping:
  field_mappings:
    - name: id
      type: text
      tokenizer: raw
      description: "unique identifier for the event"
    - name: specversion
      type: text
      stored: false
      indexed: false
      description: "version information about the CloudEvents specification"
    - name: source
      type: text
      tokenizer: raw
      description: "information about where the event occurred"
    - name: subject
      type: text
      tokenizer: raw
      description: "detailed information about the source where the event occurred"
    - name: time
      type: datetime
      input_formats:
        - unix_timestamp
        - iso8601
      output_format: unix_timestamp_nanos
      fast: true
      description: "timestamp of the event"
    - name: datacontenttype
      type: text
      tokenizer: raw
      description: "content type of the data"
    - name: requestId
      type: text
      tokenizer: raw
      description: "request id"
    - name: ip
      type: ip
      fast: true
      description: "ip address"
    - name: userAgent
      type: text
      tokenizer: default
    - name: xUserAgent
      type: text
      tokenizer: default
    - name: userId
      type: text
      tokenizer: raw
    - name: deviceId
      type: text
      tokenizer: raw
      description: "device's id eg) 6ae1f6a6-107d-3183-a1df-adf7368f9d10"
    - name: latency
      type: text
      indexed: false
      stored: false
      description: "latency of the request, but already in latencyNs, so we don't need to index it"
    - name: latencyNs
      type: i64
      fast: true
      description: "latency of the request in nanoseconds"
    - name: occurredAt
      type: datetime
      input_formats:
        - unix_timestamp
        - iso8601
      output_format: unix_timestamp_nanos
      description: "timestamp of the log entry"
    - name: httpStatusCode
      type: i64
      fast: true
      description: "HTTP status code"
    - name: httpMethod
      type: text
      fast: true
      description: "HTTP method"
    - name: httpPath
      type: text
      fast: true
      description: "HTTP path"
    - name: grpcStatusCode
      type: i64
      fast: true
      description: "gRPC status code"
    - name: grpcMethod
      type: text
      fast: true
      description: "gRPC method"
    - name: extra
      type: json
      expand_dots: true
      description: "extra fields"
    - name: kafkaConsumerGroupId
      type: text
      fast: false
      description: "Kafka Consumer Group Id"
    - name: kafkaConsumerClientId
      type: text
      fast: false
      description: "Kafka Consumer Client Id"
    - name: kafkaConsumerHostName
      type: text
      fast: false
      description: "Kafka Consumer Host Name"
    - name: kafkaTopic
      type: text
      fast: false
      description: "Kafka Topic"
    - name: kafkaPartition
      type: i64
      description: "Kafka Partition"
    - name: kafkaOffset
      type: i64
      description: "Kafka Offset"
    - name: kafkaMessageKey
      type: text
      tokenizer: raw
      description: "Kafka Message Key"
    - name: kafkaConsumingResult
      type: text
      fast: true
      description: "Kafka Consuming Result"
    - name: env
      type: text
      tokenizer: raw
      indexed: true
      description: "represents environmental information as an extension field in the cloudEvents specification"
    - name: region
      type: text
      tokenizer: raw
      fast: true
      indexed: true
      description: "represents region information as an extension field in the cloudEvents specification"
    - name: namespace
      type: text
      tokenizer: raw
      fast: true
      indexed: true
      description: "represents namespace information as an extension field in the cloudEvents specification"
  timestamp_field: time
  tag_fields: [region]
  partition_key: namespace

search_settings:
    default_search_fields: [id, requestId, userId, extra.grpc.headers.x-request-id]

indexing_settings:
  merge_policy:
    type: "stable_log"
    merge_factor: 10
    max_merge_factor: 12
    maturation_period: 48h
  commit_timeout_secs: 30

retention:
  period: 30 days
  schedule: daily

Dec 09 '25 08:12 earlbread

As a temporary workaround, I increased max.poll.interval.ms from 5 minutes to 10 minutes. This reduced the frequency of the errors, but they still occur.

Dec 09 '25 08:12 earlbread

We are running event more recent versions and never got such an issue. This means that either Kafka is struggling to process requests or more likely that the indexing is saturating. Basically se source stops polling if it cannot send anything to the doc processor which is in turn limited by the rest of the indexing pipeline. You can see the actor backpressure with the rate(quickwit_indexing_backpressure_micros[1m]) metric.

Dec 09 '25 15:12 rdettai-sk

@rdettai-sk Thank you. As you said, after the version upgrade the backpressure increased a lot. Could there be any possible causes we can infer?

Dec 09 '25 16:12 earlbread

The backpressure is increased on doc_processor, indexer, and merge_executor. I noticed that #5898 changed the default for QW_DISABLE_TOKIO_LIFO_SLOT from false to true. Could this be contributing to the increased backpressure?

Dec 09 '25 17:12 earlbread

It seems fairly unlikely. If you want to be sure you would need to revert right before that change. How many indexes/sources do you have?

Dec 09 '25 17:12 rdettai-sk

I have about 10 indexes and sources. The problem is that one index is indexing at about 800 megabytes per second, and the current number of partitions and pipelines is 54. The migration of the metastore has changed, making it difficult to rollback to a previous version.

Dec 09 '25 17:12 earlbread

I'm surprised this ever worked for you without https://github.com/quickwit-oss/quickwit/pull/5808

Dec 09 '25 18:12 rdettai-sk

@rdettai-sk We waited for it to be merged into main, and we're planning to try it ourselves soon. For reference, the issue above has not occurred anymore after deployment with the QW_DISABLE_TOKIO_LIFO_SLOT environment variable set to false.

Dec 10 '25 05:12 earlbread