redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

CI Failure in partition_balancer_test.PartitionBalancerTest.test_full_nodes

Open BenPope opened this issue 3 years ago • 10 comments

Version & Environment

Redpanda version: dev

What went wrong?

CI Failure

What should have happened instead?

Ci Success

How to reproduce the issue?

??

Additional information

CI Failure: https://buildkite.com/redpanda/redpanda/builds/13764#01827be6-58e1-4f56-8692-47e1ef55e0b5

[WARNING - 2022-08-08 06:18:22,909 - service_registry - free_all - lineno:83]: Error cleaning service <FranzGoVerifiableProducer-0-140169200079712: num_nodes: 1, nodes: ['docker-rp-24']>: 'super' object has no attribute 'free_all'
[INFO  - 2022-08-08 06:18:22,909 - runner_client - log - lineno:278]: RunnerClient: rptest.tests.partition_balancer_test.PartitionBalancerTest.test_full_nodes: Summary: AssertionError()
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 448, in test_full_nodes
    assert used_ratio < 0.8
AssertionError

BenPope avatar Aug 08 '22 07:08 BenPope

Also got a crash in this test: https://buildkite.com/redpanda/redpanda/builds/13842#0182811f-0072-4f5d-9220-359c79d86828

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 38, in wrapped
    self.redpanda.raise_on_crash()
  File "/root/tests/rptest/services/redpanda.py", line 1048, in raise_on_crash
    raise NodeCrash(crashes)
rptest.services.utils.NodeCrash: <NodeCrash (docker-rp-8,docker-rp-20,docker-rp-21,docker-rp-22) docker-rp-8: ERROR 2022-08-09 07:01:31,572 [shard 0] assert - Assert failure: (../../../src/v/cluster/node/local_monitor.cc:109) 'used < *_disk_size_for_test' mock disk size 2147483648 must be > used size 2333523968
>

BenPope avatar Aug 09 '22 14:08 BenPope

@ajfabbri what do you think?

dotnwat avatar Aug 09 '22 18:08 dotnwat

I got a failure in this test but different to Ben's: this was complaining about BadLogLines:

https://buildkite.com/redpanda/redpanda/builds/13828#01827fde-bcb1-487b-b14f-9ad6dded442f

--------------------------------------------------------------------------------
--


[INFO:2022-08-09 00:51:17,651]: RunnerClient: rptest.tests.partition_balancer_test.PartitionBalancerTest.test_full_nodes: FAIL: <BadLogLines nodes=docker-rp-20(1) example="ERROR 2022-08-09 00:50:53,785 [shard 0] cluster - storage space alert: free space at 4.050% on /var/lib/redpanda/data: 2.000GiB total, 82.941MiB free, min. free 0.000bytes. Please adjust retention policies as needed to avoid running out of space">

Traceback (most recent call last):

  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run

    data = self.run_test()

  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test

    return self.test_context.function(self.test)

  File "/root/tests/rptest/services/cluster.py", line 48, in wrapped

    self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)

  File "/root/tests/rptest/services/redpanda.py", line 1126, in raise_on_bad_logs

    raise BadLogLines(bad_lines)

rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-20(1) example="ERROR 2022-08-09 00:50:53,785 [shard 0] cluster - storage space alert: free space at 4.050% on /var/lib/redpanda/data: 2.000GiB total, 82.941MiB free, min. free 0.000bytes. Please adjust retention policies as needed to avoid running out of space">

travisdowns avatar Aug 09 '22 18:08 travisdowns

We investigated this one with @mmaslankaprv this morning - basically it is caused by stale health monitor data after a controller leader changed (partition balancer thought that partition sizes were zero when in reality they were quite big). Should be fixed by #5922

ztlpn avatar Aug 09 '22 18:08 ztlpn

should be fixed by #5922

ztlpn avatar Aug 11 '22 09:08 ztlpn

Saw another instance of this

FAIL test: PartitionBalancerTest.test_full_nodes (1/24 runs)
  failure at 2022-08-16T16:42:59.830Z: AssertionError()
      in job https://buildkite.com/redpanda/redpanda/builds/14220#0182a71a-48cc-4534-8c04-7d2bea40908c

Stack trace:

====================================================================================================
test_id:    rptest.tests.partition_balancer_test.PartitionBalancerTest.test_full_nodes
status:     FAIL
run time:   1 minute 20.539 seconds


    AssertionError()
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 473, in test_full_nodes
    assert used_ratio < 0.8
AssertionError

andrwng avatar Aug 17 '22 18:08 andrwng

Another instance https://buildkite.com/redpanda/redpanda/builds/14311#0182b1e9-cd33-421b-a908-d7400ba039df

rystsov avatar Aug 19 '22 03:08 rystsov

Another instance https://ci-artifacts.dev.vectorized.cloud/redpanda/0182cb94-7664-473d-9208-0445f34e5bce/vbuild/ducktape/results/2022-08-23--001/PartitionBalancerTest/test_full_nodes/113/

bharathv avatar Aug 24 '22 04:08 bharathv

Another instance https://ci-artifacts.dev.vectorized.cloud/redpanda/0182ae3b-de90-4596-bfb2-bd4ad36a228c/vbuild/ducktape/results/2022-08-18--001/report.html

andrwng avatar Aug 24 '22 21:08 andrwng

Another instance on 19.09.2022:

https://buildkite.com/redpanda/redpanda/builds/15468#01835650-c8ef-448b-a958-f521a6fd0569

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 531, in test_full_nodes
    assert used_ratio < 0.8
AssertionError

VladLazar avatar Sep 20 '22 08:09 VladLazar

So the problem is that after the partition balancer becomes ready but before we check the disk usage ratio something may happen that will increase disk usage for ~10-20MiB (like the leader balancer kicking in and forcing a segment roll along with associated segment fallocation) and increase the node disk usage over the limit.

I'm preparing a pr that will check that the partition balancer actually stabilized (i.e. the status was "ready" for a few ticks) but we may have to implement a few other improvements:

  1. Not directly related to this test failure, but it turned out the balancer underestimates the partition size a bit: #6538
  2. The situation when a node is oscillating near the disk usage limit is clearly undesireable. We may need to introduce some kind of hysteresis (i.e. if node goes over 80%, bring it down to 75% or 80% - 100MiB or bring it down to the average disk usage or something similar).

ztlpn avatar Sep 27 '22 14:09 ztlpn