redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

Failure in PartitionBalancerTest.test_unavailable_nodes

Open r-vasquez opened this issue 3 years ago • 13 comments
trafficstars

Build: https://buildkite.com/redpanda/redpanda/builds/12554#0181fda7-1a9d-47cd-86fb-775767011acb

test_id:    rptest.tests.partition_balancer_test.PartitionBalancerTest.test_unavailable_nodes
--
  | status:     FAIL
  | run time:   6 minutes 33.620 seconds
  |  
  |  
  | NodeCrash([(<ducktape.cluster.cluster.ClusterNode object at 0x7ff7d826b0a0>, 'Redpanda process unexpectedly stopped')])
  | Traceback (most recent call last):
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/partition_balancer_test.py", line 126, in test_unavailable_nodes
  | self.wait_until_ready()
  | File "/root/tests/rptest/tests/partition_balancer_test.py", line 86, in wait_until_ready
  | return self.wait_until_status(
  | File "/root/tests/rptest/tests/partition_balancer_test.py", line 79, in wait_until_status
  | return wait_until_result(
  | File "/root/tests/rptest/util.py", line 110, in wait_until_result
  | wait_until(wrapped_condition, *args, **kwargs)
  | File "/root/tests/rptest/util.py", line 68, in wait_until
  | raise e
  | File "/root/tests/rptest/util.py", line 61, in wait_until
  | if condition():
  | File "/root/tests/rptest/util.py", line 97, in wrapped_condition
  | cond = condition()
  | File "/root/tests/rptest/tests/partition_balancer_test.py", line 71, in check
  | status = admin.get_partition_balancer_status(timeout=1)
  | File "/root/tests/rptest/services/admin.py", line 702, in get_partition_balancer_status
  | return self._request("GET",
  | File "/root/tests/rptest/services/admin.py", line 334, in _request
  | r.raise_for_status()
  | File "/usr/local/lib/python3.9/dist-packages/requests/models.py", line 941, in raise_for_status
  | raise HTTPError(http_error_msg, response=self)
  | requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://docker-rp-6:9644/v1/cluster/partition_balancer/status
  |  
  | During handling of the above exception, another exception occurred:
  |  
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/services/cluster.py", line 38, in wrapped
  | self.redpanda.raise_on_crash()
  | File "/root/tests/rptest/services/redpanda.py", line 1030, in raise_on_crash
  | raise NodeCrash(crashes)
  | rptest.services.utils.NodeCrash: <NodeCrash docker-rp-5: Redpanda process unexpectedly stopped>

r-vasquez avatar Jul 14 '22 18:07 r-vasquez

another failure: https://buildkite.com/redpanda/vtools/builds/2897#0182159d-22f0-4bf8-93c9-317e0a909eea

VladLazar avatar Jul 19 '22 12:07 VladLazar

Reason of Failure: We are trying to get status from node-rp-1, which is not controller. So node-rp-1 is proxying request to node-rp-9, which was controller. But node-rp-9 is turned off in test. Leader table didn't update, so we get this error.

To fix it we can add retries to status request

ZeDRoman avatar Jul 19 '22 15:07 ZeDRoman

@ZeDRoman : Thanks, but why the crash in RP?

piyushredpanda avatar Jul 19 '22 20:07 piyushredpanda

@ZeDRoman : Thanks, but why the crash in RP?

Last crash in log happens because we don't start node if test is failed in the middle of evaluation We stop nodes to test rebalancing. So if test is crashed before we start node again, we will see that error This error will not occur if there won't be any other error

ZeDRoman avatar Jul 19 '22 21:07 ZeDRoman

Thanks, @ZeDRoman : so the "crash" here is what we did to bring Redpanda down to test rebalancing. Is that correct? I wonder if all test failures will show up as RP crash, then? If yes, is there a way for the test to be structured that this is not the case (it would raise alarms very quickly on seeing RP crash in test).

piyushredpanda avatar Jul 19 '22 21:07 piyushredpanda

so the "crash" here is what we did to bring Redpanda down to test rebalancing. Is that correct?

Yes. We crash nodes on our own to trigger rebalancing

I wonder if all test failures will show up as RP crash, then? If yes, is there a way for the test to be structured that this is not the case (it would raise alarms very quickly on seeing RP crash in test).

As I know, now every test will show that log if redpanda node is down at the end of the test evaluation. This check is part of out implementation of redpanda service in ducktape. I didn't see any approach to avoid this.

ZeDRoman avatar Jul 19 '22 21:07 ZeDRoman

Let's chat with the team on some ideas to make this better.

For this ticket: so it is a test-issue that you plan to have a PR to retry status, yes?

piyushredpanda avatar Jul 19 '22 21:07 piyushredpanda

For this ticket: so it is a test-issue that you plan to have a PR to retry status, yes?

yes I will create pr soon

ZeDRoman avatar Jul 19 '22 21:07 ZeDRoman

Again https://buildkite.com/redpanda/redpanda/builds/12972#018229b7-81d0-4ed8-9129-de98e1bb9e5d

VadimPlh avatar Jul 25 '22 10:07 VadimPlh

As per the discussion with @mmaslankaprv this morning, missing_node_rpc_client error can be a consequence of the node trying to make an rpc request to itself. This can happen if the node is a provisional controller leader (hasn't yet committed the first batch in its leadership term).

We need to check for this situation (leader_id == self, but is_leader == false) in admin_server.cc and return cluster::errc::no_leader_controller

ztlpn avatar Jul 25 '22 13:07 ztlpn

Seen in CI failures today here.

ajfabbri avatar Jul 28 '22 04:07 ajfabbri

This one looks the same: https://ci-artifacts.dev.vectorized.cloud/redpanda/0182806b-2b0f-4933-9c42-ab3ef015faab/vbuild/ducktape/results/2022-08-09--001/PartitionBalancerTest/test_rack_awareness/34/

BenPope avatar Aug 09 '22 15:08 BenPope

[DEBUG - 2022-08-09 03:07:42,301 - admin - _request - lineno:303]: Dispatching GET http://docker-rp-13:9644/v1/cluster/partition_balancer/status
[WARNING - 2022-08-09 03:07:43,271 - admin - _request - lineno:321]: Response 500: {"message": "Unexpected error: rpc::errc::disconnected_endpoint(node down)", "code": 500}

There was a problem with completing rpc requests around this time. The problem is rather inexplicable because both nodes were online and rpc requests timed out only in one direction. So my bet would be on an rpc bug conjectured in https://github.com/redpanda-data/redpanda/issues/5608#issuecomment-1203527881

ztlpn avatar Aug 10 '22 02:08 ztlpn

BTW the test will be "fixed" once #5916 merges (admin server will return 503 which is a retryable error code).

ztlpn avatar Aug 10 '22 23:08 ztlpn

Admin server will now return 503 and deeper rpc problems are tracked in #6005. Going to close.

ztlpn avatar Aug 15 '22 13:08 ztlpn