redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

CI Failure in controller_upgrade_test.ControllerUpgradeTest.test_updating_cluster_when_executing_operations

Open BenPope opened this issue 2 years ago • 2 comments

Version & Environment

Redpanda version: dev

https://buildkite.com/redpanda/redpanda/builds/13761#01827bd6-e3be-48dc-8851-3662e1bca36b

What went wrong?

CI Failure

What should have happened instead?

CI Success

How to reproduce the issue?

???

Additional information

[INFO  - 2022-08-08 05:52:40,466 - runner_client - log - lineno:278]: RunnerClient: rptest.tests.controller_upgrade_test.ControllerUpgradeTest.test_updating_cluster_when_executing_operations: FAIL: <BadLogLines nodes=docker-rp-10(4) example="ERROR 2022-08-08 05:48:52,100 [shard 0] rpc - Service handler threw an exception: std::exception (std::exception)">
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 48, in wrapped
    self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
  File "/root/tests/rptest/services/redpanda.py", line 1121, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-10(4) example="ERROR 2022-08-08 05:48:52,100 [shard 0] rpc - Service handler threw an exception: std::exception (std::exception)">

And from docker-rp-10, during shutdown:

TRACE 2022-08-08 05:48:52,099 [shard 0] kafka - fetch.cc:548 - handling fetch request: {replica_id=-1 max_wait_ms=500 min_bytes=1 max_bytes=52428800 isolation_level=0 session_id=1 session_epoch=2894 topics={{name={topic-gpwztznbjk} fetch_partitions={{partition_index=0 current_leader_epoch=-1 fetch_offset=15425 log_start_offset=-1 max_bytes=1048576}}}} forgotten={} rack_id=}
INFO  2022-08-08 05:48:52,099 [shard 0] redpanda::main - application.cc:200 - Stopping...
INFO  2022-08-08 05:48:52,099 [shard 0] rpc - server.cc:242 - kafka rpc protocol - Stopping 2 listeners
TRACE 2022-08-08 05:48:52,099 [shard 1] kafka - fetch.cc:337 - fetch_ntps_in_parallel: for 2 partitions returning 0 total bytes
DEBUG 2022-08-08 05:48:52,099 [shard 0] rpc - server.cc:246 - kafka rpc protocol - Service probes {connects: 16, current connections: 2, connection close errors: 0, requests completed: 2947, received bytes: 342679, sent bytes: 733673, corrupted headers: 0, method not found errors: 0, requests blocked by memory: 0}
INFO  2022-08-08 05:48:52,099 [shard 0] rpc - server.cc:251 - kafka rpc protocol - Shutting down 2 connections
INFO  2022-08-08 05:48:52,099 [shard 1] rpc - server.cc:242 - kafka rpc protocol - Stopping 2 listeners
DEBUG 2022-08-08 05:48:52,099 [shard 1] rpc - server.cc:246 - kafka rpc protocol - Service probes {connects: 17, current connections: 2, connection close errors: 0, requests completed: 3270, received bytes: 666172, sent bytes: 232632, corrupted headers: 0, method not found errors: 0, requests blocked by memory: 0}
INFO  2022-08-08 05:48:52,099 [shard 1] rpc - server.cc:251 - kafka rpc protocol - Shutting down 2 connections
TRACE 2022-08-08 05:48:52,099 [shard 0] rpc - conn_quota.cc:118 - put(172.16.16.23)
TRACE 2022-08-08 05:48:52,099 [shard 1] rpc - conn_quota.cc:118 - put(172.16.16.18)
TRACE 2022-08-08 05:48:52,099 [shard 1] rpc - conn_quota.cc:118 - put(172.16.16.23)
TRACE 2022-08-08 05:48:52,099 [shard 0] cluster - config_frontend.cc:118 - set_status from node 4: {cluster_status: node 4, version: 5, restart: true (0 invalid, 0 unknown)}
TRACE 2022-08-08 05:48:52,099 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:186 - Self append entries - {raft_group:{0}, commit_index:{85}, term:{1}, prev_log_index:{85}, prev_log_term:{1}}
TRACE 2022-08-08 05:48:52,099 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:196 - Leader append result: {append_time:2656360, base_offset:{86}, last_offset:{86}, byte_size:95}
TRACE 2022-08-08 05:48:52,099 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:96 - Sending append entries request {raft_group:{0}, commit_index:{85}, term:{1}, prev_log_index:{85}, prev_log_term:{1}} to {id: {4}, revision: {0}}
TRACE 2022-08-08 05:48:52,099 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:96 - Sending append entries request {raft_group:{0}, commit_index:{85}, term:{1}, prev_log_index:{85}, prev_log_term:{1}} to {id: {2}, revision: {0}}
TRACE 2022-08-08 05:48:52,099 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:96 - Sending append entries request {raft_group:{0}, commit_index:{85}, term:{1}, prev_log_index:{85}, prev_log_term:{1}} to {id: {3}, revision: {0}}
TRACE 2022-08-08 05:48:52,099 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:96 - Sending append entries request {raft_group:{0}, commit_index:{85}, term:{1}, prev_log_index:{85}, prev_log_term:{1}} to {id: {5}, revision: {0}}
INFO  2022-08-08 05:48:52,099 [shard 0] rpc - server.cc:242 - vectorized internal rpc protocol - Stopping 1 listeners
DEBUG 2022-08-08 05:48:52,099 [shard 0] rpc - server.cc:246 - vectorized internal rpc protocol - Service probes {connects: 12, current connections: 4, connection close errors: 0, requests completed: 1851, received bytes: 672827, sent bytes: 171179, corrupted headers: 8, method not found errors: 0, requests blocked by memory: 0}
INFO  2022-08-08 05:48:52,099 [shard 0] rpc - server.cc:251 - vectorized internal rpc protocol - Shutting down 4 connections
INFO  2022-08-08 05:48:52,099 [shard 1] rpc - server.cc:242 - vectorized internal rpc protocol - Stopping 1 listeners
TRACE 2022-08-08 05:48:52,099 [shard 0] cluster - config_frontend.cc:118 - set_status from node 3: {cluster_status: node 3, version: 5, restart: true (0 invalid, 0 unknown)}
DEBUG 2022-08-08 05:48:52,099 [shard 1] rpc - server.cc:246 - vectorized internal rpc protocol - Service probes {connects: 3, current connections: 3, connection close errors: 0, requests completed: 5861, received bytes: 2335698, sent bytes: 525907, corrupted headers: 0, method not found errors: 0, requests blocked by memory: 0}
INFO  2022-08-08 05:48:52,099 [shard 1] rpc - server.cc:251 - vectorized internal rpc protocol - Shutting down 3 connections
TRACE 2022-08-08 05:48:52,099 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:186 - Self append entries - {raft_group:{0}, commit_index:{85}, term:{1}, prev_log_index:{86}, prev_log_term:{1}}
DEBUG 2022-08-08 05:48:52,099 [shard 1] rpc - could not parse header from client: 172.16.16.22:63683
TRACE 2022-08-08 05:48:52,099 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:196 - Leader append result: {append_time:2656360, base_offset:{87}, last_offset:{87}, byte_size:95}
DEBUG 2022-08-08 05:48:52,099 [shard 1] rpc - could not parse header from client: 172.16.16.10:59575
DEBUG 2022-08-08 05:48:52,099 [shard 1] rpc - could not parse header from client: 172.16.16.21:57833
TRACE 2022-08-08 05:48:52,099 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:96 - Sending append entries request {raft_group:{0}, commit_index:{85}, term:{1}, prev_log_index:{86}, prev_log_term:{1}} to {id: {5}, revision: {0}}
TRACE 2022-08-08 05:48:52,099 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:96 - Sending append entries request {raft_group:{0}, commit_index:{85}, term:{1}, prev_log_index:{86}, prev_log_term:{1}} to {id: {2}, revision: {0}}
TRACE 2022-08-08 05:48:52,099 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:96 - Sending append entries request {raft_group:{0}, commit_index:{85}, term:{1}, prev_log_index:{86}, prev_log_term:{1}} to {id: {3}, revision: {0}}
TRACE 2022-08-08 05:48:52,099 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:96 - Sending append entries request {raft_group:{0}, commit_index:{85}, term:{1}, prev_log_index:{86}, prev_log_term:{1}} to {id: {4}, revision: {0}}
TRACE 2022-08-08 05:48:52,099 [shard 0] cluster - config_frontend.cc:118 - set_status from node 2: {cluster_status: node 2, version: 5, restart: true (0 invalid, 0 unknown)}
TRACE 2022-08-08 05:48:52,099 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:186 - Self append entries - {raft_group:{0}, commit_index:{85}, term:{1}, prev_log_index:{87}, prev_log_term:{1}}
TRACE 2022-08-08 05:48:52,099 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:196 - Leader append result: {append_time:2656360, base_offset:{88}, last_offset:{88}, byte_size:95}
TRACE 2022-08-08 05:48:52,099 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:96 - Sending append entries request {raft_group:{0}, commit_index:{85}, term:{1}, prev_log_index:{87}, prev_log_term:{1}} to {id: {3}, revision: {0}}
TRACE 2022-08-08 05:48:52,099 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:96 - Sending append entries request {raft_group:{0}, commit_index:{85}, term:{1}, prev_log_index:{87}, prev_log_term:{1}} to {id: {5}, revision: {0}}
TRACE 2022-08-08 05:48:52,099 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:96 - Sending append entries request {raft_group:{0}, commit_index:{85}, term:{1}, prev_log_index:{87}, prev_log_term:{1}} to {id: {4}, revision: {0}}
TRACE 2022-08-08 05:48:52,099 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:96 - Sending append entries request {raft_group:{0}, commit_index:{85}, term:{1}, prev_log_index:{87}, prev_log_term:{1}} to {id: {2}, revision: {0}}
DEBUG 2022-08-08 05:48:52,099 [shard 0] rpc - could not parse header from client: 172.16.16.21:64008
TRACE 2022-08-08 05:48:52,099 [shard 0] cluster - config_frontend.cc:118 - set_status from node 5: {cluster_status: node 5, version: 5, restart: true (0 invalid, 0 unknown)}
TRACE 2022-08-08 05:48:52,099 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:186 - Self append entries - {raft_group:{0}, commit_index:{85}, term:{1}, prev_log_index:{88}, prev_log_term:{1}}
TRACE 2022-08-08 05:48:52,100 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:196 - Leader append result: {append_time:2656360, base_offset:{89}, last_offset:{89}, byte_size:95}
TRACE 2022-08-08 05:48:52,100 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:96 - Sending append entries request {raft_group:{0}, commit_index:{85}, term:{1}, prev_log_index:{88}, prev_log_term:{1}} to {id: {4}, revision: {0}}
TRACE 2022-08-08 05:48:52,100 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:96 - Sending append entries request {raft_group:{0}, commit_index:{85}, term:{1}, prev_log_index:{88}, prev_log_term:{1}} to {id: {2}, revision: {0}}
TRACE 2022-08-08 05:48:52,100 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:96 - Sending append entries request {raft_group:{0}, commit_index:{85}, term:{1}, prev_log_index:{88}, prev_log_term:{1}} to {id: {3}, revision: {0}}
TRACE 2022-08-08 05:48:52,100 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:96 - Sending append entries request {raft_group:{0}, commit_index:{85}, term:{1}, prev_log_index:{88}, prev_log_term:{1}} to {id: {5}, revision: {0}}
DEBUG 2022-08-08 05:48:52,100 [shard 1] cluster - controller_backend.cc:401 - Abort requested while reconciling topics
DEBUG 2022-08-08 05:48:52,100 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:357 - Replication of entries with last offset: 86 aborted - shutting down
DEBUG 2022-08-08 05:48:52,100 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:357 - Replication of entries with last offset: 87 aborted - shutting down
DEBUG 2022-08-08 05:48:52,100 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:357 - Replication of entries with last offset: 88 aborted - shutting down
DEBUG 2022-08-08 05:48:52,100 [shard 0] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:357 - Replication of entries with last offset: 89 aborted - shutting down
DEBUG 2022-08-08 05:48:52,100 [shard 0] cluster - controller_backend.cc:401 - Abort requested while reconciling topics
TRACE 2022-08-08 05:48:52,100 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2092 - flushed offset updated: 86
TRACE 2022-08-08 05:48:52,100 [shard 0] cluster - members_backend.cc:109 - error waiting for members updates - seastar::abort_requested_exception (abort requested)
ERROR 2022-08-08 05:48:52,100 [shard 0] rpc - Service handler threw an exception: std::exception (std::exception)
INFO  2022-08-08 05:48:52,100 [shard 0] rpc - simple_protocol.cc:85 - Error dispatching method: std::runtime_error (batched_output_stream is already closed. Ignoring: 26 bytes)
ERROR 2022-08-08 05:48:52,100 [shard 0] rpc - Service handler threw an exception: std::exception (std::exception)
INFO  2022-08-08 05:48:52,100 [shard 0] rpc - simple_protocol.cc:85 - Error dispatching method: std::runtime_error (batched_output_stream is already closed. Ignoring: 26 bytes)
ERROR 2022-08-08 05:48:52,100 [shard 0] rpc - Service handler threw an exception: std::exception (std::exception)
INFO  2022-08-08 05:48:52,100 [shard 0] rpc - simple_protocol.cc:85 - Error dispatching method: std::runtime_error (batched_output_stream is already closed. Ignoring: 26 bytes)
ERROR 2022-08-08 05:48:52,100 [shard 0] rpc - Service handler threw an exception: std::exception (std::exception)
INFO  2022-08-08 05:48:52,100 [shard 0] rpc - simple_protocol.cc:85 - Error dispatching method: std::runtime_error (batched_output_stream is already closed. Ignoring: 26 bytes)

BenPope avatar Aug 08 '22 08:08 BenPope

It looks like the errors are being logged while the node is still on v22.1.4.

I suspect this is caused by the fact that https://github.com/redpanda-data/redpanda/pull/5097 seems to only live only on 22.2+, specifically the bits around adding a new exception type for disconnect errors https://github.com/redpanda-data/redpanda/commit/9ad09f0f2239ffc2a9de8c64cd05b9ee048c8796, https://github.com/redpanda-data/redpanda/pull/5097/commits/d339b68fac5b9ce2ef2025cdb1be5c283568e632, https://github.com/redpanda-data/redpanda/pull/5097/commits/07c8b7e7b876c380a71d019075897565fb9691aa

To boot, https://github.com/redpanda-data/redpanda/pull/4895 only lives on 22.1.6+, which likely explains the bare exception.

I think I'll go about bumping the version being upgraded from to v22.1.6+, though longer term this is a tricky and annoying problem to solve: the error-log-allow list is itself a sort of loosely defined protocol that evolves over time, but isn't versioned for upgrade tests on dev. Longer term I think we'll probably want to do away with "allowed errors" altogether, but until then there's a fine line between being hiding real bugs and avoiding noise, especially with mixed version tests.

andrwng avatar Aug 08 '22 18:08 andrwng

https://buildkite.com/redpanda/redpanda/builds/13844#018281c1-180b-49b6-843a-3b59d51ef7d9

BenPope avatar Aug 09 '22 14:08 BenPope

I agree with @andrwng assessemnt, here's another example I experienced

grep -e ERROR -e Redpanda.v redpanda\ \(2\).log 
INFO   redpanda::main - application.cc:162 - Redpanda v22.1.4 - 491e56900d2316fcbb22aa1d37e7195897878309
ERROR 2022-08-18 17:21:37,026 [shard 0] rpc - Service handler threw an exception: std::exception (std::exception)
INFO   redpanda::main - application.cc:213 - Redpanda v22.2.1-rc3-194-g241a9dbee - 241a9dbee46c174e81ec3ef86d35c11337a074e3

The log allow list for the test should be extended to include this type of bogus exception.

jcsp avatar Aug 18 '22 20:08 jcsp

Added a commit to update the allow list in https://github.com/redpanda-data/redpanda/pull/5972/commits

jcsp avatar Aug 18 '22 20:08 jcsp

Something similar here. I expected to see the newer named semaphore output, then realized the error may be from the pre-upgrade code.

ajfabbri avatar Aug 19 '22 19:08 ajfabbri