redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

Cannot consume topics after 1 week of ingestion

Open proudlygeek opened this issue 4 years ago • 3 comments

After a week of ingestion our Redpanda cluster started to refuse new consumers with the following log errors:

$ rpk topic consume messages
Got an error consuming topic 'messages', partition 4: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Got an error consuming topic 'messages', partition 5: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Got an error consuming topic 'messages', partition 3: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Got an error consuming topic 'messages', partition 2: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Got an error consuming topic 'messages', partition 9: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Got an error consuming topic 'messages', partition 7: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Got an error consuming topic 'messages', partition 0: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Got an error consuming topic 'messages', partition 8: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Got an error consuming topic 'messages', partition 1: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Got an error consuming topic 'messages', partition 6: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.

How to reproduce

While I'm not exactly sure what caused the issue here's how the topic were generated:

  1. Create the topic messages by using rpk topic create messages -p 10:
  2. Produce 4-5 million events (what we had over a week of ingestion) inside the topic

Resolution

Updating from version v21.6.1 (rev 55d3c65) to v21.6.2 (rev 7ab90dc) solved the issue, but I'm not 100% on why.

Configuration

redpanda.yaml

cluster_id: redpanda
config_file: /etc/redpanda/redpanda.yaml
node_uuid: ma7pUXWT8FMAvcC9GNNsdiW5jCYuTCaootqQYyMVVehHxcutZ
organization: redpanda-callbell
pandaproxy: {}
redpanda:
  admin:
  - address: 0.0.0.0
    port: 9644
  advertised_kafka_api:
  - address: 127.0.0.1
    name: internal
    port: 9092
  - address: 34.240.138.136
    name: external
    port: 9093
  advertised_rpc_api:
    address: 34.240.138.136
    port: 33145
  auto_create_topics_enabled: false
  data_directory: /var/lib/redpanda/data
  developer_mode: false
  kafka_api:
  - address: 127.0.0.1
    name: internal
    port: 9092
  - address: 0.0.0.0
    name: external
    port: 9093
  kafka_api_tls:
  - cert_file: /etc/redpanda/certs/vectorized.crt
    enabled: true
    key_file: /etc/redpanda/certs/vectorized.key
    name: external
    require_client_auth: true
    truststore_file: /etc/redpanda/certs/vectorized_ca.crt
  node_id: 1
  rpc_server:
    address: 127.0.0.1
    port: 33145
  seed_servers: []
rpk:
  coredump_dir: /var/lib/redpanda/coredump
  enable_memory_locking: false
  enable_usage_stats: false
  overprovisioned: false
  tune_aio_events: true
  tune_clocksource: true
  tune_coredump: false
  tune_cpu: true
  tune_disk_irq: true
  tune_disk_nomerges: true
  tune_disk_scheduler: true
  tune_disk_write_cache: true
  tune_fstrim: true
  tune_network: true
  tune_swappiness: true
  tune_transparent_hugepages: false

io-config.yaml

- mountpoint: /mnt/vectorized
  read_iops: 2999
  read_bandwidth: 131289336
  write_iops: 3048
  write_bandwidth: 131588848

cc @twmb @0x5d @senior7515

proudlygeek avatar Jun 17 '21 21:06 proudlygeek

For added context, kafkacat could also not consume the topic:

kafkacat -C -b localhost:9092 -t messages
%3|1623962486.455|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
%3|1623962486.709|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
%3|1623962487.407|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
%3|1623962488.154|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
%3|1623962488.811|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
%3|1623962489.448|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
%3|1623962490.004|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
%3|1623962490.277|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
%3|1623962490.610|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
%3|1623962491.154|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected
%3|1623962491.388|FAIL|rdkafka#consumer-1| 127.0.0.1:9092/1: Receive failed: Disconnected
% ERROR: Local: Broker transport failure: 127.0.0.1:9092/1: Receive failed: Disconnected

twmb avatar Jun 17 '21 23:06 twmb

Added context, this was on 21.6.1 -> upgrade to -> 21.6.2 solved it, but we should investigate.

emaxerrno avatar Jun 18 '21 02:06 emaxerrno

I haven't tried to reproduce, but I suspect that lowering our retention from 1 week to say 30 minutes may result in a reproducer. My hunch is that its not the version upgrade, but rather the restart that came along with the upgrade.

dotnwat avatar Jun 18 '21 03:06 dotnwat

This ticket is stale, closing.

jcsp avatar Feb 21 '23 10:02 jcsp