redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

Failure in `TopicRecoveryTest`.`test_fast2`

Open jcsp opened this issue 2 years ago • 4 comments

Seen 1x in last 30 days, so it's either new or rare.

https://buildkite.com/redpanda/redpanda/builds/18184#01845ac8-1a7f-4763-b63e-e53ec491599c/6-678

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/topic_recovery_test.py", line 1352, in test_fast2
    self.do_run(test_case)
  File "/root/tests/rptest/tests/topic_recovery_test.py", line 1258, in do_run
    test_case.validate_cluster(baseline, restored)
  File "/root/tests/rptest/tests/topic_recovery_test.py", line 706, in validate_cluster
    self._produce_and_verify(topic)
  File "/root/tests/rptest/tests/topic_recovery_test.py", line 219, in _produce_and_verify
    assert old_state.hwm != new_state.hwm, \
AssertionError: old_hw 3319 unexpectedly same as new_hw 3319 for topic spec: panda-topic-1

(Split this off from https://github.com/redpanda-data/redpanda/issues/6356 which was about a different failure mode)

jcsp avatar Nov 10 '22 14:11 jcsp

So this looks like a benign failure: after recovery the test produces a little data (10k records) and checks that highwater marks increased for all partitions. But in this test all 10k records landed in the same partition:

before

PARTITION  LEADER  EPOCH  REPLICAS  LOG-START-OFFSET  HIGH-WATERMARK
0          1       2      [1 2 3]   0                 3392
1          3       2      [1 2 3]   0                 3289
2          3       2      [1 2 3]   0                 3319

after:

0          1       2      [1 2 3]   0                 13392
1          3       2      [1 2 3]   0                 3289
2          3       2      [1 2 3]   0                 3319

as counts didn't increase for partitions 1 and 2, this caused for the test to fail

ztlpn avatar Nov 11 '22 16:11 ztlpn

Ok, the problem is actually not with the client partitioner, but because node 2 (rp-4) sent the metadata response without node 3 that happens to host leaders for partitions 1 and 2:

TRACE 2022-11-09 07:15:48,114 [shard 0] kafka - request_context.h:168 - [172.16.16.27:51836] sending 3:metadata for {ducktape}, response {throttle_time_ms=0 brokers={{node_id=2 host=docker-rp-4 port=9092 rack={nullopt}}, {node_id=1 host=docker-rp-14 port=9092 rack={nullopt}}} cluster_id={redpanda.8bfd0886-c25f-43a2-9d62-a8dd4c20a52e} controller_id=1 topics={{error_code={ error_code: none [0] } name={panda-topic-1} is_internal=false partitions={{error_code={ error_code: none [0] } partition_index=0 leader_id=1 leader_epoch=2 replica_nodes={{1}, {3}, {2}} isr_nodes={{1}, {3}, {2}} offline_replicas={}}, {error_code={ error_code: none [0] } partition_index=1 leader_id=3 leader_epoch=2 replica_nodes={{3}, {2}, {1}} isr_nodes={{3}, {2}, {1}} offline_replicas={}}, {error_code={ error_code: none [0] } partition_index=2 leader_id=3 leader_epoch=2 replica_nodes={{3}, {2}, {1}} isr_nodes={{3}, {2}, {1}} offline_replicas={}}} topic_authorized_operations=0}} cluster_authorized_operations=-2147483648}

This happened because node 1 (rp-14) that was the controller leader didn't manage to get health reports from nodes 2 and 3 after a restart until it was too late:

INFO  2022-11-09 07:15:33,477 [shard 0] main - application.cc:1496 - Successfully started Redpanda!
...
DEBUG 2022-11-09 07:15:44,717 [shard 0] cluster - health_monitor_backend.cc:312 - refreshing health cache, leader id: {1} self: 1                                                    │
DEBUG 2022-11-09 07:15:44,717 [shard 0] cluster - health_monitor_backend.cc:567 - collecting cluster health statistics                                                               │
INFO  2022-11-09 07:15:44,717 [shard 0] cluster - health_monitor_backend.cc:556 - received node 2 health report, marking node as up                                                  │
DEBUG 2022-11-09 07:15:44,717 [shard 0] cluster - health_monitor_backend.cc:621 - collecting health report with filter: {include_partitions: true, ntp_filters: {}}                  │
INFO  2022-11-09 07:15:44,718 [shard 0] cluster - health_monitor_backend.cc:556 - received node 3 health report, marking node as up

And that happened because (I think) addition of cluster nodes raced with the initial health monitor update and then the cluster health data remained inadequate for the duration of the staleness timeout.

ztlpn avatar Nov 16 '22 02:11 ztlpn

Ah-ha... but we could still fix this by changing the partitioner, right? If we used a partitioner than insisted on using deterministic partition choice based on key, it would wait until the partitions it wanted were available.

jcsp avatar Nov 16 '22 09:11 jcsp

Well, I am not sure, maybe even the simple round robin partitioner takes node availability into account. So I think the better fix is to wait for the cluster to stabilize.

And I would say it is a redpanda issue that due to a race condition between the health monitor and nodes registration the information about node 2 was not available for ~15 seconds. Although we should probably wait for the change that @mmaslankaprv is working on to land (changing node registration to use controller commands instead of raft0 configs) before fixing it.

ztlpn avatar Nov 16 '22 16:11 ztlpn

@ztlpn is there a particular PR this one is waiting for? I'm not up to speed on the controller change you mentioned.

jcsp avatar Nov 21 '22 14:11 jcsp

@jcsp I was talking about Michal's rfc. Although Michal drew my attention to the fact that the controller log is nuked between restarts so join requests have to be processed again and we need to wait for the cluster membership to stabilize anyway.

ztlpn avatar Nov 29 '22 01:11 ztlpn