redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

ARM+CDT: `MaintenanceTest`.`test_maintenance` / `test_maintenance_sticky` fails

Open jcsp opened this issue 3 years ago • 1 comments

https://buildkite.com/redpanda/vtools/builds/4324#018496a2-e835-4a1e-b9a1-af52a6a95f3b

Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/maintenance_test.py", line 207, in test_maintenance
    self._maintenance_disable(target)
  File "/home/ubuntu/redpanda/tests/rptest/tests/maintenance_test.py", line 191, in _maintenance_disable
    wait_until(lambda: self._has_leadership_role(node),
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

I'm surprised if this is happening just purely due to the relatively slower cores on the arm nodes, as we aren't moving around a tonne of stuff in this test.

jcsp avatar Nov 22 '22 11:11 jcsp

I tested this by hand on ARM to check if it was some environment issue like slow disks. It looks like a genuine test failure.

The leader balancer is waking up 30 seconds after the node leaves maintenance mode, as it should, and seeing that all the leaderships are on 2/3 of the nodes, and one node has none, but then concluding that no moves should be made.

ip-172-31-39-246/redpanda.log:DEBUG 2022-11-22 13:26:12,781 [shard  0] cluster - leader_balancer.cc:431 - No leadership balance improvements found with total delta 9.999999999999996, number of muted groups 6

The "muted groups" part makes me think of a timing issue (was the test intended to have groups exit the muted state before the next leader tick, but it's only 6 groups out of 30, so I wonder if we perhaps have an arithmetic bug in the leader balancer. Then again, this test is only failing when run on dedicated VMs, and not in docker, so if something was badly broken I would expect it to show up both places.

jcsp avatar Nov 22 '22 13:11 jcsp

Another instance: https://buildkite.com/redpanda/vtools/builds/4730#018526d6-f604-42f0-b676-a71840fdf989

bharathv avatar Dec 19 '22 17:12 bharathv

Again: https://buildkite.com/redpanda/vtools/builds/4788#01853123-39d8-452b-b909-9c057db48f19

bharathv avatar Dec 21 '22 17:12 bharathv

Quite a few of these - each variation of the test failed in the same run, so it could be environmental.

FAIL test: MaintenanceTest.test_maintenance.use_rpk=False (7/153 runs) failure at 2022-12-30T04:28:17.969Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4928#01855f7c-4001-4ed2-90cc-e7a1398b565a failure at 2022-12-29T04:31:38.661Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4923#01855a56-3e54-4997-9a52-fa50cb839888 failure at 2022-12-28T04:28:12.819Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4918#0185552f-c829-43a8-9d3b-b74db0421f55 failure at 2022-12-27T04:39:59.859Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4913#0185500b-bdaa-4cb0-b640-31259ecac3be failure at 2022-12-26T04:32:21.108Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4908#01854ae2-7717-429d-85ac-87d26f4e421a failure at 2022-12-24T04:13:23.294Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45 failure at 2022-12-25T04:20:18.252Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4898#018545bd-095d-4b37-b549-0173b1d272c1 FAIL test: MaintenanceTest.test_maintenance.use_rpk=True (7/153 runs) failure at 2022-12-30T04:28:17.969Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4928#01855f7c-4001-4ed2-90cc-e7a1398b565a failure at 2022-12-29T04:31:38.661Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4923#01855a56-3e54-4997-9a52-fa50cb839888 failure at 2022-12-28T04:28:12.819Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4918#0185552f-c829-43a8-9d3b-b74db0421f55 failure at 2022-12-27T04:39:59.859Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4913#0185500b-bdaa-4cb0-b640-31259ecac3be failure at 2022-12-26T04:32:21.108Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4908#01854ae2-7717-429d-85ac-87d26f4e421a failure at 2022-12-24T04:13:23.294Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45 failure at 2022-12-25T04:20:18.252Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4898#018545bd-095d-4b37-b549-0173b1d272c1 FAIL test: MaintenanceTest.test_maintenance_sticky.use_rpk=False (7/153 runs) failure at 2022-12-30T04:28:17.969Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4928#01855f7c-4001-4ed2-90cc-e7a1398b565a failure at 2022-12-29T04:31:38.661Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4923#01855a56-3e54-4997-9a52-fa50cb839888 failure at 2022-12-28T04:28:12.819Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4918#0185552f-c829-43a8-9d3b-b74db0421f55 failure at 2022-12-27T04:39:59.859Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4913#0185500b-bdaa-4cb0-b640-31259ecac3be failure at 2022-12-26T04:32:21.108Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4908#01854ae2-7717-429d-85ac-87d26f4e421a failure at 2022-12-24T04:13:23.294Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45 failure at 2022-12-25T04:20:18.252Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4898#018545bd-095d-4b37-b549-0173b1d272c1 FAIL test: MaintenanceTest.test_maintenance_sticky.use_rpk=True (7/153 runs) failure at 2022-12-30T04:28:17.969Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4928#01855f7c-4001-4ed2-90cc-e7a1398b565a failure at 2022-12-29T04:31:38.661Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4923#01855a56-3e54-4997-9a52-fa50cb839888 failure at 2022-12-28T04:28:12.819Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4918#0185552f-c829-43a8-9d3b-b74db0421f55 failure at 2022-12-27T04:39:59.859Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4913#0185500b-bdaa-4cb0-b640-31259ecac3be failure at 2022-12-26T04:32:21.108Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4908#01854ae2-7717-429d-85ac-87d26f4e421a failure at 2022-12-24T04:13:23.294Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45 failure at 2022-12-25T04:20:18.252Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4898#018545bd-095d-4b37-b549-0173b1d272c1

BenPope avatar Dec 30 '22 14:12 BenPope

Thanks. I've started looking into this.

dotnwat avatar Dec 30 '22 18:12 dotnwat

Another instance of this on CDT https://buildkite.com/redpanda/vtools/builds/4961#01857415-de4d-45ef-82aa-fd1e1e438729/6-7644

https://buildkite.com/redpanda/vtools/builds/4961#01857415-de4d-45ef-82aa-fd1e1e438729/6-7620

Lazin avatar Jan 05 '23 10:01 Lazin

on (arm64, VM) https://buildkite.com/redpanda/vtools/builds/5007#01857e64-9b47-44f1-ade7-92fcfff06d73

  • test_maintenance.use_rpk=True
  • test_maintenance_sticky.use_rpk=False
  • test_maintenance.use_rpk=False
  • test_maintenance_sticky.use_rpk=True

dlex avatar Jan 06 '23 02:01 dlex

on (arm64, VM) https://buildkite.com/redpanda/vtools/builds/5047#0185838c-0b8d-4784-abe4-d77cf7b61b3e on (arm64, VM) https://buildkite.com/redpanda/vtools/builds/5007#01858500-14b5-4276-87c4-449e1f6c4880

  • test_maintenance.use_rpk=True
  • test_maintenance.use_rpk=False
  • test_maintenance_sticky.use_rpk=True
  • test_maintenance_sticky.use_rpk=False

dlex avatar Jan 06 '23 18:01 dlex

failed on arm nightly CDT run: https://buildkite.com/redpanda/vtools/builds/5295#0185c159-c9c2-4ea8-9a27-107193993aa2

MaintenanceTest.test_maintenance.use_rpk=False (2/4 runs) MaintenanceTest.test_maintenance.use_rpk=True (2/4 runs) MaintenanceTest.test_maintenance_sticky.use_rpk=False (2/4 runs) MaintenanceTest.test_maintenance_sticky.use_rpk=True (2/4 runs)

abhijat avatar Jan 18 '23 05:01 abhijat

job https://buildkite.com/redpanda/vtools/builds/5396#0185db17-163b-49ec-ace4-b4238606be02

FAIL test: MaintenanceTest.test_maintenance.use_rpk=False (1/39 runs)
  failure at 2023-01-23T04:13:45.139Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5396#0185db17-163b-49ec-ace4-b4238606be02
FAIL test: MaintenanceTest.test_maintenance.use_rpk=True (1/39 runs)
  failure at 2023-01-23T04:13:45.139Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5396#0185db17-163b-49ec-ace4-b4238606be02
FAIL test: MaintenanceTest.test_maintenance_sticky.use_rpk=False (1/39 runs)
  failure at 2023-01-23T04:13:45.139Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5396#0185db17-163b-49ec-ace4-b4238606be02
FAIL test: MaintenanceTest.test_maintenance_sticky.use_rpk=True (1/39 runs)
  failure at 2023-01-23T04:13:45.139Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5396#0185db17-163b-49ec-ace4-b4238606be02

andijcr avatar Jan 23 '23 14:01 andijcr

https://buildkite.com/redpanda/vtools/builds/5426#0185e03d-d646-4dc3-b49b-3ff4cfe3c97a

FAIL test: MaintenanceTest.test_maintenance.use_rpk=False (1/30 runs)
  failure at 2023-01-24T04:13:12.241Z: TimeoutError('')
FAIL test: MaintenanceTest.test_maintenance.use_rpk=True (1/31 runs)
  failure at 2023-01-24T04:13:12.241Z: TimeoutError('')
FAIL test: MaintenanceTest.test_maintenance_sticky.use_rpk=False (1/31 runs)
  failure at 2023-01-24T04:13:12.241Z: TimeoutError('')
FAIL test: MaintenanceTest.test_maintenance_sticky.use_rpk=True (1/30 runs)
  failure at 2023-01-24T04:13:12.241Z: TimeoutError('')

rystsov avatar Jan 24 '23 07:01 rystsov

https://buildkite.com/redpanda/vtools/builds/5455#0185e565-386c-42c5-824d-8cb81a46d627

FAIL test: MaintenanceTest.test_maintenance.use_rpk=False (1/35 runs)
  failure at 2023-01-25T03:40:52.893Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5455#0185e565-386c-42c5-824d-8cb81a46d627
FAIL test: MaintenanceTest.test_maintenance.use_rpk=True (1/35 runs)
  failure at 2023-01-25T03:40:52.893Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5455#0185e565-386c-42c5-824d-8cb81a46d627
FAIL test: MaintenanceTest.test_maintenance_sticky.use_rpk=False (1/35 runs)
  failure at 2023-01-25T03:40:52.893Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5455#0185e565-386c-42c5-824d-8cb81a46d627
FAIL test: MaintenanceTest.test_maintenance_sticky.use_rpk=True (1/35 runs)
  failure at 2023-01-25T03:40:52.893Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5455#0185e565-386c-42c5-824d-8cb81a46d627

andijcr avatar Jan 25 '23 14:01 andijcr

https://buildkite.com/redpanda/vtools/builds/5500#0185efb8-35af-4ec9-8e11-2d6d69c3d310

FAIL test: MaintenanceTest.test_maintenance.use_rpk=False (1/42 runs)
  failure at 2023-01-27T03:54:39.906Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5500#0185efb8-35af-4ec9-8e11-2d6d69c3d310
FAIL test: MaintenanceTest.test_maintenance.use_rpk=True (1/41 runs)
  failure at 2023-01-27T03:54:39.906Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5500#0185efb8-35af-4ec9-8e11-2d6d69c3d310
FAIL test: MaintenanceTest.test_maintenance_sticky.use_rpk=False (1/43 runs)
  failure at 2023-01-27T03:54:39.906Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5500#0185efb8-35af-4ec9-8e11-2d6d69c3d310
FAIL test: MaintenanceTest.test_maintenance_sticky.use_rpk=True (1/42 runs)
  failure at 2023-01-27T03:54:39.906Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5500#0185efb8-35af-4ec9-8e11-2d6d69c3d310

andijcr avatar Jan 27 '23 10:01 andijcr

I paused looking at this when @ballard26 indicated that maybe many/most/all our arm issues were resource issues. @ballard26 do you have a hunch that that might be true here? If not, then I may start looking at this again. Thanks!

dotnwat avatar Jan 27 '23 19:01 dotnwat

@dotnwat It is sorta related to the resource issues. Basically the nightly ARM CDT cluster has 16 shards a node or 48 shards for 3 nodes(the size of the cluster in this test). While the nightly x86 cluster only has 4 shards a node or 12 shards for 3 nodes.

I think if we increase the partition count for the test things will work as expected. Right now the balancer doesn't care which node a given shard is on. It only cares that every shard has an equal number of leadership assignments. In this test we are trying to balance leadership for 30 partitions. So when the balancer is asked to evenly balance 30 leadership assignments over 48 shards it'll do it's best, but obviously not every shard can be assigned a leader.

So when one node goes down(or is drained of leadership) it rebalances so that the remaining shards have equal leadership. That's still 32 shards. So each shard will only be the leader of one partition. When the node goes back up it's still the case that each shard is the leader of one or less partitions. And since the leader balancer doesn't care what node a shard is on it concludes that the cluster is as balanced as it's going to be.

Increasing the partition count in the test to anything greater than 32 will ensure that one shard will have at least two leadership assignments. So when the node exits maintenance mode the balancer will be able to move some leadership back to it.

ballard26 avatar Jan 27 '23 21:01 ballard26

The system may have been getting into this state due to resource issues, but the thing that @jcsp points out above does indicate an underlying bug too:

https://github.com/redpanda-data/redpanda/issues/7428#issuecomment-1323704865

Maybe we should at least improve the logging to include more information about group muting and/or how muted groups affected the balancing results (though the latter may not be easy because of how the groups are skipped during the calculation).

travisdowns avatar Feb 02 '23 13:02 travisdowns