redpanda
redpanda copied to clipboard
Failure at PartitionBalancerTest.test_rack_awareness: TimeoutError('failed to wait until status condition')
only on arm and debug:
FAIL test: PartitionBalancerTest.test_rack_awareness (1/46 runs) failure at 2022-10-20T07:34:28.340Z: TimeoutError('failed to wait until status condition') in job https://buildkite.com/redpanda/redpanda/builds/16960#0183f3d6-9bfc-485a-a901-ef9a80939798
I can't undestand what is going on here.
In logs I see that partition balancer assigned 46 reallocations
INFO 2022-10-20 06:20:34,594 [shard 0] cluster - partition_balancer_backend.cc:199 - last status: in_progress; violations: unavailable nodes: 1, full nodes: 0; updates in progress: 0; reassignments planned: 46, cancelled: 0, failed: 0
It is moving partitions from node 1 which is down to node 2.
INFO 2022-10-20 06:20:34,601 [shard 0] cluster - partition_balancer_backend.cc:227 - moving {kafka/topic-spunqqmufx/8} to {{node_id: 2, shard: 0}, {node_id: 5, shard: 0}, {node_id: 3, shard: 0}}
In node 2 log I see that replication is going
INFO 2022-10-20 06:20:41,807 [shard 0] storage - segment.cc:655 - Creating new segment /var/lib/redpanda/data/kafka/topic-spunqqmufx/8_472/1346-2-v1.log
TRACE 2022-10-20 06:20:41,807 [shard 0] storage - segment_reader.cc:120 - ::get segment file /var/lib/redpanda/data/kafka/topic-spunqqmufx/8_472/1346-2-v1.log, refcount=0
DEBUG 2022-10-20 06:20:41,808 [shard 0] storage - segment_reader.cc:124 - Opening segment file /var/lib/redpanda/data/kafka/topic-spunqqmufx/8_472/1346-2-v1.log
But it ends with
INFO 2022-10-20 06:20:42,513 [shard 0] raft - [group_id:9, {kafka/topic-spunqqmufx/8}] consensus.cc:538 - Node {id: {2}, revision: {472}} recovery cancelled (rpc::errc::client_request_timeout)
I think it might happen because of too many movements at one time, so node can't replicate everything in time. I don't see any special logs that make me understand what is going on @mmaslankaprv Maybe you can help me there?
Similar error but for test_movement_cancellations
on ARM
https://buildkite.com/redpanda/vtools/builds/4201#018463f2-bed8-437b-af13-fa9cc677ecaf/6-7538
Also similar error but for test_rack_constraint_repair
on ARM
https://buildkite.com/redpanda/vtools/builds/4201#018463f2-bed8-437b-af13-fa9cc677ecaf/6-7442
https://buildkite.com/redpanda/vtools/builds/4196#01846250-d53f-473f-83be-eb6536cfa36f/6-7408
And test_maintenance_mode
on ARM
https://buildkite.com/redpanda/vtools/builds/4201#018463f2-bed8-437b-af13-fa9cc677ecaf/6-7414
And test_fuzz_admin_ops
on ARM
https://buildkite.com/redpanda/vtools/builds/4196#01846250-d53f-473f-83be-eb6536cfa36f/6-7382
https://buildkite.com/redpanda/vtools/builds/4201#018463f2-bed8-437b-af13-fa9cc677ecaf/6-7388
And test_partition_balancer_with_limits
on ARM
https://buildkite.com/redpanda/vtools/builds/4196#01846250-d53f-473f-83be-eb6536cfa36f/6-7290
https://buildkite.com/redpanda/vtools/builds/4201#018463f2-bed8-437b-af13-fa9cc677ecaf/6-7318
And test_rack_awareness
on ARM
https://buildkite.com/redpanda/vtools/builds/4201#018463f2-bed8-437b-af13-fa9cc677ecaf/6-7236
https://buildkite.com/redpanda/vtools/builds/4196#01846250-d53f-473f-83be-eb6536cfa36f/6-7208
And on ControllerLogLimitPartitionBalancerTests.test_partition_balancer_with_limits
https://buildkite.com/redpanda/vtools/builds/4196#01846250-d53f-473f-83be-eb6536cfa36f/6-7290
I will try to decrease movement batch size
this time on cdt (along a bunch of other failures) https://buildkite.com/redpanda/vtools/builds/4255#018477c1-dbc9-454c-bd32-fa3b114f7df8
The first failure posted here is caused by the fact that we do not leave joint consensus when there are learners in configuration.
In controler_backend
logs we may see a lot of messages indicating that replicas on node 1
which is down are still learners, hence we can not make progress. Learners to be promoted need to be up to date with the leader = available.
The original failure mode that was described in this issue should be fixed with: https://github.com/redpanda-data/redpanda/pull/6798
A bunch more, notably, most of them on the same run.
FAIL test: PartitionBalancerTest.test_fuzz_admin_ops (1/153 runs) failure at 2022-12-24T04:13:23.294Z: TimeoutError('failed to wait until status condition') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45 FAIL test: PartitionBalancerTest.test_movement_cancellations (1/153 runs) failure at 2022-12-24T04:13:23.294Z: TimeoutError('failed to wait until status condition') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45 FAIL test: PartitionBalancerTest.test_rack_awareness (2/153 runs) failure at 2022-12-24T07:12:54.691Z: TimeoutError('failed to wait until status condition') on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/20344#018542d4-d6f4-44e7-bf9e-a3b4e36494a3 failure at 2022-12-24T04:13:23.294Z: TimeoutError('failed to wait until status condition') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45 FAIL test: PartitionBalancerTest.test_rack_constraint_repair (1/153 runs) failure at 2022-12-24T04:13:23.294Z: TimeoutError('failed to wait until status condition') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45 FAIL test: PartitionBalancerTest.test_unavailable_nodes (1/153 runs) failure at 2022-12-24T04:13:23.294Z: TimeoutError('failed to wait until status condition') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45
by all indications this is area/controller, so removing the other area labels. please add back if i im mistaken.
@ZeDRoman : Can this be closed then?
https://buildkite.com/redpanda/redpanda/builds/22815#0186339d-f52a-4cd0-8dd6-0d3bb85991f7
https://buildkite.com/redpanda/redpanda/builds/22973#01863be1-091f-42c4-93f0-9cdbc2698e9d
"The test_rack_awareness failed in debug mode only, i checked and is it related with the fact that nodes are slow i.e. recovery is making very slow progress. I am going to add a commit skipping this test in debug builds." - Michal in #8744
https://buildkite.com/redpanda/redpanda/builds/22874#0186371a-d868-4361-b201-cdece7c1672e