redpanda
redpanda copied to clipboard
CI Failure in partition_balancer_test.PartitionBalancerTest.test_full_nodes
Version & Environment
Redpanda version: dev
What went wrong?
CI Failure
What should have happened instead?
Ci Success
How to reproduce the issue?
??
Additional information
CI Failure: https://buildkite.com/redpanda/redpanda/builds/13764#01827be6-58e1-4f56-8692-47e1ef55e0b5
[WARNING - 2022-08-08 06:18:22,909 - service_registry - free_all - lineno:83]: Error cleaning service <FranzGoVerifiableProducer-0-140169200079712: num_nodes: 1, nodes: ['docker-rp-24']>: 'super' object has no attribute 'free_all'
[INFO - 2022-08-08 06:18:22,909 - runner_client - log - lineno:278]: RunnerClient: rptest.tests.partition_balancer_test.PartitionBalancerTest.test_full_nodes: Summary: AssertionError()
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
r = f(self, *args, **kwargs)
File "/root/tests/rptest/tests/partition_balancer_test.py", line 448, in test_full_nodes
assert used_ratio < 0.8
AssertionError
Also got a crash in this test: https://buildkite.com/redpanda/redpanda/builds/13842#0182811f-0072-4f5d-9220-359c79d86828
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/root/tests/rptest/services/cluster.py", line 38, in wrapped
self.redpanda.raise_on_crash()
File "/root/tests/rptest/services/redpanda.py", line 1048, in raise_on_crash
raise NodeCrash(crashes)
rptest.services.utils.NodeCrash: <NodeCrash (docker-rp-8,docker-rp-20,docker-rp-21,docker-rp-22) docker-rp-8: ERROR 2022-08-09 07:01:31,572 [shard 0] assert - Assert failure: (../../../src/v/cluster/node/local_monitor.cc:109) 'used < *_disk_size_for_test' mock disk size 2147483648 must be > used size 2333523968
>
@ajfabbri what do you think?
I got a failure in this test but different to Ben's: this was complaining about BadLogLines:
https://buildkite.com/redpanda/redpanda/builds/13828#01827fde-bcb1-487b-b14f-9ad6dded442f
--------------------------------------------------------------------------------
--
[INFO:2022-08-09 00:51:17,651]: RunnerClient: rptest.tests.partition_balancer_test.PartitionBalancerTest.test_full_nodes: FAIL: <BadLogLines nodes=docker-rp-20(1) example="ERROR 2022-08-09 00:50:53,785 [shard 0] cluster - storage space alert: free space at 4.050% on /var/lib/redpanda/data: 2.000GiB total, 82.941MiB free, min. free 0.000bytes. Please adjust retention policies as needed to avoid running out of space">
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/root/tests/rptest/services/cluster.py", line 48, in wrapped
self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
File "/root/tests/rptest/services/redpanda.py", line 1126, in raise_on_bad_logs
raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-20(1) example="ERROR 2022-08-09 00:50:53,785 [shard 0] cluster - storage space alert: free space at 4.050% on /var/lib/redpanda/data: 2.000GiB total, 82.941MiB free, min. free 0.000bytes. Please adjust retention policies as needed to avoid running out of space">
We investigated this one with @mmaslankaprv this morning - basically it is caused by stale health monitor data after a controller leader changed (partition balancer thought that partition sizes were zero when in reality they were quite big). Should be fixed by #5922
should be fixed by #5922
Saw another instance of this
FAIL test: PartitionBalancerTest.test_full_nodes (1/24 runs)
failure at 2022-08-16T16:42:59.830Z: AssertionError()
in job https://buildkite.com/redpanda/redpanda/builds/14220#0182a71a-48cc-4534-8c04-7d2bea40908c
Stack trace:
====================================================================================================
test_id: rptest.tests.partition_balancer_test.PartitionBalancerTest.test_full_nodes
status: FAIL
run time: 1 minute 20.539 seconds
AssertionError()
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
r = f(self, *args, **kwargs)
File "/root/tests/rptest/tests/partition_balancer_test.py", line 473, in test_full_nodes
assert used_ratio < 0.8
AssertionError
Another instance https://buildkite.com/redpanda/redpanda/builds/14311#0182b1e9-cd33-421b-a908-d7400ba039df
Another instance https://ci-artifacts.dev.vectorized.cloud/redpanda/0182cb94-7664-473d-9208-0445f34e5bce/vbuild/ducktape/results/2022-08-23--001/PartitionBalancerTest/test_full_nodes/113/
Another instance https://ci-artifacts.dev.vectorized.cloud/redpanda/0182ae3b-de90-4596-bfb2-bd4ad36a228c/vbuild/ducktape/results/2022-08-18--001/report.html
Another instance on 19.09.2022:
https://buildkite.com/redpanda/redpanda/builds/15468#01835650-c8ef-448b-a958-f521a6fd0569
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
r = f(self, *args, **kwargs)
File "/root/tests/rptest/tests/partition_balancer_test.py", line 531, in test_full_nodes
assert used_ratio < 0.8
AssertionError
So the problem is that after the partition balancer becomes ready but before we check the disk usage ratio something may happen that will increase disk usage for ~10-20MiB (like the leader balancer kicking in and forcing a segment roll along with associated segment fallocation) and increase the node disk usage over the limit.
I'm preparing a pr that will check that the partition balancer actually stabilized (i.e. the status was "ready" for a few ticks) but we may have to implement a few other improvements:
- Not directly related to this test failure, but it turned out the balancer underestimates the partition size a bit: #6538
- The situation when a node is oscillating near the disk usage limit is clearly undesireable. We may need to introduce some kind of hysteresis (i.e. if node goes over 80%, bring it down to 75% or 80% - 100MiB or bring it down to the average disk usage or something similar).