redpanda
redpanda copied to clipboard
Cannot consume topics after 1 week of ingestion
After a week of ingestion our Redpanda cluster started to refuse new consumers with the following log errors:
$ rpk topic consume messages
Got an error consuming topic 'messages', partition 4: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Got an error consuming topic 'messages', partition 5: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Got an error consuming topic 'messages', partition 3: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Got an error consuming topic 'messages', partition 2: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Got an error consuming topic 'messages', partition 9: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Got an error consuming topic 'messages', partition 7: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Got an error consuming topic 'messages', partition 0: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Got an error consuming topic 'messages', partition 8: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Got an error consuming topic 'messages', partition 1: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Got an error consuming topic 'messages', partition 6: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
How to reproduce
While I'm not exactly sure what caused the issue here's how the topic were generated:
- Create the topic
messagesby usingrpk topic create messages -p 10: - Produce 4-5 million events (what we had over a week of ingestion) inside the topic
Resolution
Updating from version v21.6.1 (rev 55d3c65) to v21.6.2 (rev 7ab90dc) solved the issue, but I'm not 100% on why.
Configuration
redpanda.yaml
cluster_id: redpanda
config_file: /etc/redpanda/redpanda.yaml
node_uuid: ma7pUXWT8FMAvcC9GNNsdiW5jCYuTCaootqQYyMVVehHxcutZ
organization: redpanda-callbell
pandaproxy: {}
redpanda:
admin:
- address: 0.0.0.0
port: 9644
advertised_kafka_api:
- address: 127.0.0.1
name: internal
port: 9092
- address: 34.240.138.136
name: external
port: 9093
advertised_rpc_api:
address: 34.240.138.136
port: 33145
auto_create_topics_enabled: false
data_directory: /var/lib/redpanda/data
developer_mode: false
kafka_api:
- address: 127.0.0.1
name: internal
port: 9092
- address: 0.0.0.0
name: external
port: 9093
kafka_api_tls:
- cert_file: /etc/redpanda/certs/vectorized.crt
enabled: true
key_file: /etc/redpanda/certs/vectorized.key
name: external
require_client_auth: true
truststore_file: /etc/redpanda/certs/vectorized_ca.crt
node_id: 1
rpc_server:
address: 127.0.0.1
port: 33145
seed_servers: []
rpk:
coredump_dir: /var/lib/redpanda/coredump
enable_memory_locking: false
enable_usage_stats: false
overprovisioned: false
tune_aio_events: true
tune_clocksource: true
tune_coredump: false
tune_cpu: true
tune_disk_irq: true
tune_disk_nomerges: true
tune_disk_scheduler: true
tune_disk_write_cache: true
tune_fstrim: true
tune_network: true
tune_swappiness: true
tune_transparent_hugepages: false
io-config.yaml
- mountpoint: /mnt/vectorized
read_iops: 2999
read_bandwidth: 131289336
write_iops: 3048
write_bandwidth: 131588848
cc @twmb @0x5d @senior7515
For added context, kafkacat could also not consume the topic:
kafkacat -C -b localhost:9092 -t messages
%3|1623962486.455|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
%3|1623962486.709|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
%3|1623962487.407|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
%3|1623962488.154|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
%3|1623962488.811|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
%3|1623962489.448|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
%3|1623962490.004|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
%3|1623962490.277|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
%3|1623962490.610|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
%3|1623962491.154|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
%3|1623962491.388|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
Added context, this was on 21.6.1 -> upgrade to -> 21.6.2 solved it, but we should investigate.
I haven't tried to reproduce, but I suspect that lowering our retention from 1 week to say 30 minutes may result in a reproducer. My hunch is that its not the version upgrade, but rather the restart that came along with the upgrade.
This ticket is stale, closing.