redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

Failure at PartitionBalancerTest.test_rack_awareness: TimeoutError('failed to wait until status condition')

Open andijcr opened this issue 2 years ago • 5 comments

only on arm and debug:

FAIL test: PartitionBalancerTest.test_rack_awareness (1/46 runs) failure at 2022-10-20T07:34:28.340Z: TimeoutError('failed to wait until status condition') in job https://buildkite.com/redpanda/redpanda/builds/16960#0183f3d6-9bfc-485a-a901-ef9a80939798

andijcr avatar Oct 21 '22 14:10 andijcr

I can't undestand what is going on here. In logs I see that partition balancer assigned 46 reallocations INFO 2022-10-20 06:20:34,594 [shard 0] cluster - partition_balancer_backend.cc:199 - last status: in_progress; violations: unavailable nodes: 1, full nodes: 0; updates in progress: 0; reassignments planned: 46, cancelled: 0, failed: 0 It is moving partitions from node 1 which is down to node 2. INFO 2022-10-20 06:20:34,601 [shard 0] cluster - partition_balancer_backend.cc:227 - moving {kafka/topic-spunqqmufx/8} to {{node_id: 2, shard: 0}, {node_id: 5, shard: 0}, {node_id: 3, shard: 0}}

In node 2 log I see that replication is going

INFO  2022-10-20 06:20:41,807 [shard 0] storage - segment.cc:655 - Creating new segment /var/lib/redpanda/data/kafka/topic-spunqqmufx/8_472/1346-2-v1.log
TRACE 2022-10-20 06:20:41,807 [shard 0] storage - segment_reader.cc:120 - ::get segment file /var/lib/redpanda/data/kafka/topic-spunqqmufx/8_472/1346-2-v1.log, refcount=0
DEBUG 2022-10-20 06:20:41,808 [shard 0] storage - segment_reader.cc:124 - Opening segment file /var/lib/redpanda/data/kafka/topic-spunqqmufx/8_472/1346-2-v1.log

But it ends with INFO 2022-10-20 06:20:42,513 [shard 0] raft - [group_id:9, {kafka/topic-spunqqmufx/8}] consensus.cc:538 - Node {id: {2}, revision: {472}} recovery cancelled (rpc::errc::client_request_timeout)

I think it might happen because of too many movements at one time, so node can't replicate everything in time. I don't see any special logs that make me understand what is going on @mmaslankaprv Maybe you can help me there?

ZeDRoman avatar Nov 11 '22 14:11 ZeDRoman

Similar error but for test_movement_cancellations on ARM https://buildkite.com/redpanda/vtools/builds/4201#018463f2-bed8-437b-af13-fa9cc677ecaf/6-7538

NyaliaLui avatar Nov 11 '22 15:11 NyaliaLui

Also similar error but for test_rack_constraint_repair on ARM https://buildkite.com/redpanda/vtools/builds/4201#018463f2-bed8-437b-af13-fa9cc677ecaf/6-7442 https://buildkite.com/redpanda/vtools/builds/4196#01846250-d53f-473f-83be-eb6536cfa36f/6-7408

And test_maintenance_mode on ARM https://buildkite.com/redpanda/vtools/builds/4201#018463f2-bed8-437b-af13-fa9cc677ecaf/6-7414

And test_fuzz_admin_ops on ARM https://buildkite.com/redpanda/vtools/builds/4196#01846250-d53f-473f-83be-eb6536cfa36f/6-7382 https://buildkite.com/redpanda/vtools/builds/4201#018463f2-bed8-437b-af13-fa9cc677ecaf/6-7388

And test_partition_balancer_with_limits on ARM https://buildkite.com/redpanda/vtools/builds/4196#01846250-d53f-473f-83be-eb6536cfa36f/6-7290 https://buildkite.com/redpanda/vtools/builds/4201#018463f2-bed8-437b-af13-fa9cc677ecaf/6-7318

And test_rack_awareness on ARM https://buildkite.com/redpanda/vtools/builds/4201#018463f2-bed8-437b-af13-fa9cc677ecaf/6-7236 https://buildkite.com/redpanda/vtools/builds/4196#01846250-d53f-473f-83be-eb6536cfa36f/6-7208

And on ControllerLogLimitPartitionBalancerTests.test_partition_balancer_with_limits https://buildkite.com/redpanda/vtools/builds/4196#01846250-d53f-473f-83be-eb6536cfa36f/6-7290

NyaliaLui avatar Nov 11 '22 16:11 NyaliaLui

I will try to decrease movement batch size

ZeDRoman avatar Nov 11 '22 16:11 ZeDRoman

this time on cdt (along a bunch of other failures) https://buildkite.com/redpanda/vtools/builds/4255#018477c1-dbc9-454c-bd32-fa3b114f7df8

andijcr avatar Nov 15 '22 15:11 andijcr

The first failure posted here is caused by the fact that we do not leave joint consensus when there are learners in configuration.

In controler_backend logs we may see a lot of messages indicating that replicas on node 1 which is down are still learners, hence we can not make progress. Learners to be promoted need to be up to date with the leader = available.

mmaslankaprv avatar Nov 23 '22 14:11 mmaslankaprv

The original failure mode that was described in this issue should be fixed with: https://github.com/redpanda-data/redpanda/pull/6798

mmaslankaprv avatar Nov 24 '22 06:11 mmaslankaprv

A bunch more, notably, most of them on the same run.

FAIL test: PartitionBalancerTest.test_fuzz_admin_ops (1/153 runs) failure at 2022-12-24T04:13:23.294Z: TimeoutError('failed to wait until status condition') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45 FAIL test: PartitionBalancerTest.test_movement_cancellations (1/153 runs) failure at 2022-12-24T04:13:23.294Z: TimeoutError('failed to wait until status condition') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45 FAIL test: PartitionBalancerTest.test_rack_awareness (2/153 runs) failure at 2022-12-24T07:12:54.691Z: TimeoutError('failed to wait until status condition') on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/20344#018542d4-d6f4-44e7-bf9e-a3b4e36494a3 failure at 2022-12-24T04:13:23.294Z: TimeoutError('failed to wait until status condition') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45 FAIL test: PartitionBalancerTest.test_rack_constraint_repair (1/153 runs) failure at 2022-12-24T04:13:23.294Z: TimeoutError('failed to wait until status condition') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45 FAIL test: PartitionBalancerTest.test_unavailable_nodes (1/153 runs) failure at 2022-12-24T04:13:23.294Z: TimeoutError('failed to wait until status condition') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45

BenPope avatar Dec 30 '22 15:12 BenPope

by all indications this is area/controller, so removing the other area labels. please add back if i im mistaken.

dotnwat avatar Jan 22 '23 21:01 dotnwat

@ZeDRoman : Can this be closed then?

piyushredpanda avatar Jan 27 '23 05:01 piyushredpanda

https://buildkite.com/redpanda/redpanda/builds/22815#0186339d-f52a-4cd0-8dd6-0d3bb85991f7

VadimPlh avatar Feb 09 '23 15:02 VadimPlh

https://buildkite.com/redpanda/redpanda/builds/22973#01863be1-091f-42c4-93f0-9cdbc2698e9d

VladLazar avatar Feb 10 '23 16:02 VladLazar

"The test_rack_awareness failed in debug mode only, i checked and is it related with the fact that nodes are slow i.e. recovery is making very slow progress. I am going to add a commit skipping this test in debug builds." - Michal in #8744

VladLazar avatar Feb 10 '23 16:02 VladLazar

https://buildkite.com/redpanda/redpanda/builds/22874#0186371a-d868-4361-b201-cdece7c1672e

dotnwat avatar Feb 10 '23 18:02 dotnwat