Kafka source issues after upgrade to edge: PollExceeded errors and consumption problems
Describe the bug
After upgrading from commit qw-airmail-20250522-hotfix (488375a9) to edge (660388a42756a739d0ef0aecd234ca953b85caf5), we are experiencing Kafka source failures.
2025-12-09 16:45:58.757 | PollExceeded (Local: Maximum application poll interval (max.poll.interval.ms) exceeded)) |
-- | -- | --
| | 2025-12-09 16:45:58.757 | 2025-12-09T07:45:58.757Z ERROR quickwit_actors::actor_context: exit activating-kill-switch actor=SourceActor-proud-RYA5 exit_status=Failure(Message consumption error: PollExceeded (Local: Maximum application poll interval (max.poll.interval.ms) exceeded) |
| | 2025-12-09 16:45:58.757 | PollExceeded (Local: Maximum application poll interval (max.poll.interval.ms) exceeded)) |
| | 2025-12-09 16:45:58.757 | 2025-12-09T07:45:58.757Z ERROR quickwit_actors::spawn_builder: actor-exit actor_id="SourceActor-proud-RYA5" phase=handling(quickwit_indexing::source::Loop) exit_status=Failure(Message consumption error: PollExceeded (Local: Maximum application poll interval (max.poll.interval.ms) exceeded) |
| | 2025-12-09 16:45:58.476 | 2025-12-09T07:45:58.476Z ERROR rdkafka::client: librdkafka: Global error: PollExceeded (Local: Maximum application poll interval (max.poll.interval.ms) exceeded): Application maximum poll interval (600000ms) exceeded by 16ms |
| | 2025-12-09 16:45:50.710 | PollExceeded (Local: Maximum application poll interval (max.poll.interval.ms) exceeded)) |
| | 2025-12-09 16:45:50.710 | 2025-12-09T07:45:50.710Z ERROR quickwit_actors::actor_context: exit activating-kill-switch actor=SourceActor-polished-D8qx exit_status=Failure(Message consumption error: PollExceeded (Local: Maximum application poll interval (max.poll.interval.ms) exceeded) |
| | 2025-12-09 16:45:50.710 | PollExceeded (Local: Maximum application poll interval (max.poll.interval.ms) exceeded)) |
| | 2025-12-09 16:45:50.710 | 2025-12-09T07:45:50.710Z ERROR quickwit_actors::spawn_builder: actor-exit actor_id="SourceActor-polished-D8qx" phase=handling(quickwit_indexing::source::Loop) exit_status=Failure(Message consumption error: PollExceeded (Local: Maximum application poll interval (max.poll.interval.ms) exceeded) |
| | 2025-12-09 16:45:50.710 | 2025-12-09T07:45:50.710Z ERROR rdkafka::client: librdkafka: Global error: PollExceeded (Local: Maximum application poll interval (max.poll.interval.ms) exceeded): Application maximum poll interval (600000ms) exceeded by 185ms
Steps to reproduce (if applicable) Steps to reproduce the behavior:
- Run indexing under sustained high load where backpressure occurs
2 .Observe
PollExceedederrors or consumption stalls
Expected behavior
Kafka source should continue consuming messages under high load without max.poll.interval.ms exceeded errors, as it did in the previous version (qw-airmail-20250522-hotfix).
Configuration: Please provide:
- Quickwit Version(edge:
660388a42756a739d0ef0aecd234ca953b85caf5) - The index_config.yaml
version: 0.8
index_id: log.common.access_log_v2_quickwit
doc_mapping:
field_mappings:
- name: id
type: text
tokenizer: raw
description: "unique identifier for the event"
- name: specversion
type: text
stored: false
indexed: false
description: "version information about the CloudEvents specification"
- name: source
type: text
tokenizer: raw
description: "information about where the event occurred"
- name: subject
type: text
tokenizer: raw
description: "detailed information about the source where the event occurred"
- name: time
type: datetime
input_formats:
- unix_timestamp
- iso8601
output_format: unix_timestamp_nanos
fast: true
description: "timestamp of the event"
- name: datacontenttype
type: text
tokenizer: raw
description: "content type of the data"
- name: requestId
type: text
tokenizer: raw
description: "request id"
- name: ip
type: ip
fast: true
description: "ip address"
- name: userAgent
type: text
tokenizer: default
- name: xUserAgent
type: text
tokenizer: default
- name: userId
type: text
tokenizer: raw
- name: deviceId
type: text
tokenizer: raw
description: "device's id eg) 6ae1f6a6-107d-3183-a1df-adf7368f9d10"
- name: latency
type: text
indexed: false
stored: false
description: "latency of the request, but already in latencyNs, so we don't need to index it"
- name: latencyNs
type: i64
fast: true
description: "latency of the request in nanoseconds"
- name: occurredAt
type: datetime
input_formats:
- unix_timestamp
- iso8601
output_format: unix_timestamp_nanos
description: "timestamp of the log entry"
- name: httpStatusCode
type: i64
fast: true
description: "HTTP status code"
- name: httpMethod
type: text
fast: true
description: "HTTP method"
- name: httpPath
type: text
fast: true
description: "HTTP path"
- name: grpcStatusCode
type: i64
fast: true
description: "gRPC status code"
- name: grpcMethod
type: text
fast: true
description: "gRPC method"
- name: extra
type: json
expand_dots: true
description: "extra fields"
- name: kafkaConsumerGroupId
type: text
fast: false
description: "Kafka Consumer Group Id"
- name: kafkaConsumerClientId
type: text
fast: false
description: "Kafka Consumer Client Id"
- name: kafkaConsumerHostName
type: text
fast: false
description: "Kafka Consumer Host Name"
- name: kafkaTopic
type: text
fast: false
description: "Kafka Topic"
- name: kafkaPartition
type: i64
description: "Kafka Partition"
- name: kafkaOffset
type: i64
description: "Kafka Offset"
- name: kafkaMessageKey
type: text
tokenizer: raw
description: "Kafka Message Key"
- name: kafkaConsumingResult
type: text
fast: true
description: "Kafka Consuming Result"
- name: env
type: text
tokenizer: raw
indexed: true
description: "represents environmental information as an extension field in the cloudEvents specification"
- name: region
type: text
tokenizer: raw
fast: true
indexed: true
description: "represents region information as an extension field in the cloudEvents specification"
- name: namespace
type: text
tokenizer: raw
fast: true
indexed: true
description: "represents namespace information as an extension field in the cloudEvents specification"
timestamp_field: time
tag_fields: [region]
partition_key: namespace
search_settings:
default_search_fields: [id, requestId, userId, extra.grpc.headers.x-request-id]
indexing_settings:
merge_policy:
type: "stable_log"
merge_factor: 10
max_merge_factor: 12
maturation_period: 48h
commit_timeout_secs: 30
retention:
period: 30 days
schedule: daily
As a temporary workaround, I increased max.poll.interval.ms from 5 minutes to 10 minutes. This reduced the frequency of the errors, but they still occur.
We are running event more recent versions and never got such an issue. This means that either Kafka is struggling to process requests or more likely that the indexing is saturating. Basically se source stops polling if it cannot send anything to the doc processor which is in turn limited by the rest of the indexing pipeline. You can see the actor backpressure with the rate(quickwit_indexing_backpressure_micros[1m]) metric.
@rdettai-sk Thank you. As you said, after the version upgrade the backpressure increased a lot. Could there be any possible causes we can infer?
The backpressure is increased on doc_processor, indexer, and merge_executor. I noticed that #5898 changed the default for QW_DISABLE_TOKIO_LIFO_SLOT from false to true. Could this be contributing to the increased backpressure?
It seems fairly unlikely. If you want to be sure you would need to revert right before that change. How many indexes/sources do you have?
I have about 10 indexes and sources. The problem is that one index is indexing at about 800 megabytes per second, and the current number of partitions and pipelines is 54. The migration of the metastore has changed, making it difficult to rollback to a previous version.
I'm surprised this ever worked for you without https://github.com/quickwit-oss/quickwit/pull/5808
@rdettai-sk We waited for it to be merged into main, and we're planning to try it ourselves soon. For reference, the issue above has not occurred anymore after deployment with the QW_DISABLE_TOKIO_LIFO_SLOT environment variable set to false.