redpanda
redpanda copied to clipboard
Failure in `TopicRecoveryTest`.`test_fast2`
Seen 1x in last 30 days, so it's either new or rare.
https://buildkite.com/redpanda/redpanda/builds/18184#01845ac8-1a7f-4763-b63e-e53ec491599c/6-678
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
r = f(self, *args, **kwargs)
File "/root/tests/rptest/tests/topic_recovery_test.py", line 1352, in test_fast2
self.do_run(test_case)
File "/root/tests/rptest/tests/topic_recovery_test.py", line 1258, in do_run
test_case.validate_cluster(baseline, restored)
File "/root/tests/rptest/tests/topic_recovery_test.py", line 706, in validate_cluster
self._produce_and_verify(topic)
File "/root/tests/rptest/tests/topic_recovery_test.py", line 219, in _produce_and_verify
assert old_state.hwm != new_state.hwm, \
AssertionError: old_hw 3319 unexpectedly same as new_hw 3319 for topic spec: panda-topic-1
(Split this off from https://github.com/redpanda-data/redpanda/issues/6356 which was about a different failure mode)
So this looks like a benign failure: after recovery the test produces a little data (10k records) and checks that highwater marks increased for all partitions. But in this test all 10k records landed in the same partition:
before
PARTITION LEADER EPOCH REPLICAS LOG-START-OFFSET HIGH-WATERMARK
0 1 2 [1 2 3] 0 3392
1 3 2 [1 2 3] 0 3289
2 3 2 [1 2 3] 0 3319
after:
0 1 2 [1 2 3] 0 13392
1 3 2 [1 2 3] 0 3289
2 3 2 [1 2 3] 0 3319
as counts didn't increase for partitions 1 and 2, this caused for the test to fail
Ok, the problem is actually not with the client partitioner, but because node 2 (rp-4) sent the metadata response without node 3 that happens to host leaders for partitions 1 and 2:
TRACE 2022-11-09 07:15:48,114 [shard 0] kafka - request_context.h:168 - [172.16.16.27:51836] sending 3:metadata for {ducktape}, response {throttle_time_ms=0 brokers={{node_id=2 host=docker-rp-4 port=9092 rack={nullopt}}, {node_id=1 host=docker-rp-14 port=9092 rack={nullopt}}} cluster_id={redpanda.8bfd0886-c25f-43a2-9d62-a8dd4c20a52e} controller_id=1 topics={{error_code={ error_code: none [0] } name={panda-topic-1} is_internal=false partitions={{error_code={ error_code: none [0] } partition_index=0 leader_id=1 leader_epoch=2 replica_nodes={{1}, {3}, {2}} isr_nodes={{1}, {3}, {2}} offline_replicas={}}, {error_code={ error_code: none [0] } partition_index=1 leader_id=3 leader_epoch=2 replica_nodes={{3}, {2}, {1}} isr_nodes={{3}, {2}, {1}} offline_replicas={}}, {error_code={ error_code: none [0] } partition_index=2 leader_id=3 leader_epoch=2 replica_nodes={{3}, {2}, {1}} isr_nodes={{3}, {2}, {1}} offline_replicas={}}} topic_authorized_operations=0}} cluster_authorized_operations=-2147483648}
This happened because node 1 (rp-14) that was the controller leader didn't manage to get health reports from nodes 2 and 3 after a restart until it was too late:
INFO 2022-11-09 07:15:33,477 [shard 0] main - application.cc:1496 - Successfully started Redpanda!
...
DEBUG 2022-11-09 07:15:44,717 [shard 0] cluster - health_monitor_backend.cc:312 - refreshing health cache, leader id: {1} self: 1 │
DEBUG 2022-11-09 07:15:44,717 [shard 0] cluster - health_monitor_backend.cc:567 - collecting cluster health statistics │
INFO 2022-11-09 07:15:44,717 [shard 0] cluster - health_monitor_backend.cc:556 - received node 2 health report, marking node as up │
DEBUG 2022-11-09 07:15:44,717 [shard 0] cluster - health_monitor_backend.cc:621 - collecting health report with filter: {include_partitions: true, ntp_filters: {}} │
INFO 2022-11-09 07:15:44,718 [shard 0] cluster - health_monitor_backend.cc:556 - received node 3 health report, marking node as up
And that happened because (I think) addition of cluster nodes raced with the initial health monitor update and then the cluster health data remained inadequate for the duration of the staleness timeout.
Ah-ha... but we could still fix this by changing the partitioner, right? If we used a partitioner than insisted on using deterministic partition choice based on key, it would wait until the partitions it wanted were available.
Well, I am not sure, maybe even the simple round robin partitioner takes node availability into account. So I think the better fix is to wait for the cluster to stabilize.
And I would say it is a redpanda issue that due to a race condition between the health monitor and nodes registration the information about node 2 was not available for ~15 seconds. Although we should probably wait for the change that @mmaslankaprv is working on to land (changing node registration to use controller commands instead of raft0 configs) before fixing it.
@ztlpn is there a particular PR this one is waiting for? I'm not up to speed on the controller change you mentioned.
@jcsp I was talking about Michal's rfc. Although Michal drew my attention to the fact that the controller log is nuked between restarts so join requests have to be processed again and we need to wait for the cluster membership to stabilize anyway.