Failure in `PartitionMoveInterruption`.`test_cancelling_partition_move_x_core`

Open jcsp opened this issue 1 year ago • 19 comments

FAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=3.unclean_abort=True.recovery=restart_recovery (1/84 runs) failure at 2022-07-23T08:22:18.320Z: <BadLogLines nodes=docker-rp-2(1) example="ERROR 2022-07-23 07:26:57,918 [shard 1] rpc - transport.h:179 - Protocol violation: request version rpc::transport_version::v1 incompatible with reply version rpc::transport_version::v2"> in job https://buildkite.com/redpanda/redpanda/builds/12971#018229ae-5aed-4afd-9a59-3d4331bfdc09

Protocol violation: request version rpc::transport_version::v1 incompatible with reply version rpc::transport_version::v2

While this is "only" a BadLogLines, it's surprising to me that this test is hitting any encoding hiccups at all, as it's not overtly doing anything with mixed versions or upgrades.

Jul 25 '22 09:07 jcsp

@dotnwat I haven't dug into the logs, but is this something ever expected under normal circumstances?

Jul 25 '22 09:07 jcsp

While this is "only" a BadLogLines, it's surprising to me that this test is hitting any encoding hiccups at all, as it's not overtly doing anything with mixed versions or upgrades.

Right, no upgrades can still do this. Transports start off assuming old nodes, and immediately upgrade after a round trip or two. There were a few cases I fixed last week where an RPC error reply was encoded with the wrong transport version (old version) due to an unexamined error handling path.

This is probably similar, but because it is an issue with the reply being upgraded, not downgraded, it is more likely that it is a problem with the type system tricks we are using to select between adl / serde. At least that's my hunch at this point.

Jul 26 '22 14:07 dotnwat

Another one: https://buildkite.com/redpanda/redpanda/builds/13173#01823f39-c0bb-401e-b94f-e4832ec8627d from https://buildkite.com/redpanda/redpanda/builds/13173#01823f39-c0bb-401e-b94f-e4832ec8627d

test_id: rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=3.unclean_abort=True.recovery=no_recovery

status: FAIL

run time: 2 minutes 33.439 seconds
<BadLogLines nodes=docker-rp-14(1) example="ERROR 2022-07-27 11:46:46,883 [shard 1] storage - cannot continue parsing. recived size:403 bytes, expected:16314 bytes. context:parser::consume_records">
Traceback (most recent call last):

File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/root/tests/rptest/services/cluster.py", line 48, in wrapped
self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
File "/root/tests/rptest/services/redpanda.py", line 1124, in raise_on_bad_logs
raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-14(1) example="ERROR 2022-07-27 11:46:46,883 [shard 1] storage - cannot continue parsing. recived size:403 bytes, expected:16314 bytes. context:parser::consume_records">

Jul 27 '22 15:07 LenaAn

Another badloglines example

https://buildkite.com/redpanda/redpanda/builds/13171#01823ef7-8a0f-4c14-b1e5-942de9800afe

rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-14(1) example="ERROR 2022-07-27 10:22:51,731 [shard 1] r/heartbeat - heartbeat_manager.cc:284 - cannot find consensus group:20">

Jul 27 '22 17:07 jcsp

I'll follow John's comment above with another r/heartbeat - heartbeat_manager.cc:284 - cannot find consensus group:5 error in this test here.

Jul 28 '22 04:07 ajfabbri

I haven't dug into these two buildkite links that you left @ajfabbri and @jcsp but the badloglines don't show the protocol violation. are we just recording failures of the test regardless of the way they present?

Jul 28 '22 14:07 dotnwat

In general I like to have different failure modes in different tickets, but in the case of a brand new test that's failing several ways, I think it makes sense to bundle them up here.

Jul 28 '22 14:07 jcsp

Regarding the cannot find consensus group error logs, I looked into it a bit and I think it's worth reducing severity (at least in some cases). See https://github.com/redpanda-data/redpanda/pull/5742 for more context.

Jul 29 '22 22:07 andrwng

@ztlpn does this one look interesting? im not sure what all the partition interruption work is doing:



<BadLogLines nodes=docker-rp-14(1) example="ERROR 2022-07-27 11:46:46,883 [shard 1] storage - cannot continue parsing. recived size:403 bytes, expected:16314 bytes. context:parser::consume_records">
--
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
  | return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  | File "/root/tests/rptest/services/cluster.py", line 48, in wrapped
  | self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
  | File "/root/tests/rptest/services/redpanda.py", line 1124, in raise_on_bad_logs
  | raise BadLogLines(bad_lines)
  | rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-14(1) example="ERROR 2022-07-27 11:46:46,883 [shard 1] storage - cannot continue parsing. recived size:403 bytes, expected:16314 bytes. context:parser::consume_records">

Jul 30 '22 01:07 dotnwat

Note that the Protocol violation didn't occur in any of the extra reported cases in this issue.

The scenarios under which the protocol violation might occur are becoming more exotic (corruption, mismatched correlation id). It's as if the response is from a previous instance of the transport (ie from before a reconnection transport reset) and then the correlation ids matched.

Jul 30 '22 01:07 dotnwat

looked into this again, and it continues to be a mystery.

the leading contender root cause is that there is a bug in the reconnect_transport resetting logic and a reply to a request from a previous "generation" of the reset reconnect_transport ends up being handled mistakenly.

assuming such a bug does exist, then it as a root cause for this issue would imply that we a pretty serious bug because correlation id matches incorrectly. this would be far more likely for two back-to-back reconnects where the correlation id is small. it's reset to 0 so we'd look for scenarios where the correlation id had't had time to become large, thus increasing the liklihood of a collision after reset.

the other potential cause, and i haven't seen any evidence to support it, is some request/reply that by-passes normal request processing code paths. the response in question is append_entries_reply. any thought there might be a special case rpc scenario @mmaslankaprv ?

Aug 03 '22 06:08 dotnwat

Another failure observed here: https://buildkite.com/redpanda/redpanda/builds/13600#01826a4d-da7d-49a9-b547-18136d155717

Aug 04 '22 21:08 graphcareful

an interesting bit is here

INFO  2022-08-04 20:23:48,024 [shard 1] rpc - parse_utils.h:59 - rpc header missmatching checksums. expected:20480, got:1044318496 - {version:0, header_checksum:20480, compression:0, payload_size:786432, meta:131072, correlation_id:2097152, payload_checksum:0}

we are reading a header off the wire and seeing a checksum mismatch. it also has the header's version as 0, which is peculiar. a few lines later we see what looks to be the same fiber (though, gosh, these logs could be improved) stating that the header arrived from the server (the same parse_utils.h:59 - rpc header is used on client and server...)

DEBUG 2022-08-04 20:23:48,025 [shard 1] rpc - transport.cc:233 - could not parse header from server: {host: docker-rp-19, port: 33145}

TRACE 2022-08-04 20:23:47,958 [shard 1] raft - [follower: {id: {3}, revision: {32}}] [group_id:3, {kafka/topic-umqeaiwvmz/2}] - recovery_stm.cc:221 - Reading batches, starting from: 23721
TRACE 2022-08-04 20:23:47,958 [shard 1] storage - readers_cache.cc:101 - {kafka/topic-umqeaiwvmz/2} - trying to get reader for: {start_offset:{23721}, max_offset:{9223372036854775807}, min_bytes:1, max_bytes:524288, type_filter:nullopt, first_timestamp:nullopt}
TRACE 2022-08-04 20:23:47,958 [shard 1] storage - readers_cache.cc:130 - {kafka/topic-umqeaiwvmz/2} - reader cache miss for: {start_offset:{23721}, max_offset:{9223372036854775807}, min_bytes:1, max_bytes:524288, type_filter:nullopt, first_timestamp:nullopt}
TRACE 2022-08-04 20:23:47,982 [shard 1] r/heartbeat - heartbeat_manager.cc:262 - Heartbeat timeout, node: 3
ERROR 2022-08-04 20:23:48,023 [shard 1] rpc - transport.h:180 - Protocol violation: request version rpc::transport_version::v1 incompatible with reply version rpc::transport_version::v2 reply type raft::append_entries_reply
TRACE 2022-08-04 20:23:48,023 [shard 1] kafka - requests.cc:97 - [172.16.16.7:38182] processing name:fetch, key:1, version:11 for consumer-test_group-1, mem_units: 8216
TRACE 2022-08-04 20:23:48,024 [shard 1] kafka - fetch.cc:549 - handling fetch request: {replica_id=-1 max_wait_ms=500 min_bytes=1 max_bytes=52428800 isolation_level=read_uncommitted session_id=1073741824 session_epoch=223 topics={} forgotten={{name={topic-umqeaiwvmz} forgotten_partition_indexes={14, 16, 6, 2, 19}}} rack_id=}
INFO  2022-08-04 20:23:48,024 [shard 1] rpc - parse_utils.h:59 - rpc header missmatching checksums. expected:20480, got:1044318496 - {version:0, header_checksum:20480, compression:0, payload_size:786432, meta:131072, correlation_id:2097152, payload_checksum:0}
TRACE 2022-08-04 20:23:48,024 [shard 1] storage - readers_cache.cc:101 - {kafka/topic-umqeaiwvmz/9} - trying to get reader for: {start_offset:{31900}, max_offset:{32177}, min_bytes:0, max_bytes:1048576, type_filter:batch_type::raft_data, first_timestamp:nullopt}
TRACE 2022-08-04 20:23:48,025 [shard 1] storage - readers_cache.cc:130 - {kafka/topic-umqeaiwvmz/9} - reader cache miss for: {start_offset:{31900}, max_offset:{32177}, min_bytes:0, max_bytes:1048576, type_filter:batch_type::raft_data, first_timestamp:nullopt}
DEBUG 2022-08-04 20:23:48,025 [shard 1] rpc - transport.cc:233 - could not parse header from server: {host: docker-rp-19, port: 33145}

the full sequence is probably that this happens first for some reason:

ERROR 2022-08-04 20:23:48,023 [shard 1] rpc - transport.h:180 - Protocol violation: request version rpc::transport_version::v1 incompatible with reply version rpc::transport_version::v2 reply type raft::append_entries_reply

which signals the body via sctx->signal_body_parse();. that signal then causes the next header on the wire to be read which results in the header checksum errors below the protocol violation above in the log.

so... is there really protocol violation? or... does the wire have junk?

It looks like on the server a second or so before the trace above the client sent a bad message. but there are others of these from other nodes and it could be unrelated. like maybe the client just disconnected?

DEBUG 2022-08-04 20:23:47,204 [shard 1] rpc - could not parse header from client: 172.16.16.24:54811

Aug 04 '22 23:08 dotnwat

Another test triggered a violation:

FAIL test: PartitionBalancerTest.test_unavailable_nodes (1/67 runs)
  failure at 2022-08-07T07:25:17.039Z: <BadLogLines nodes=docker-rp-20(1) example="ERROR 2022-08-07 06:09:28,345 [shard 0] rpc - transport.h:179 - Protocol violation: request version rpc::transport_version::v1 incompatible with reply version rpc::transport_version::v2 reply type raft::heartbeat_reply">
      in job https://buildkite.com/redpanda/redpanda/builds/13738#018276bf-fdd9-45e5-92f4-f783a59bda04

Aug 08 '22 07:08 mmaslankaprv

https://buildkite.com/redpanda/redpanda/builds/13844#018281c1-180b-49b6-843a-3b59d51ef7d9

Aug 09 '22 14:08 BenPope

@dotnwat found this issue by searching "could not parse header"

I am seeing some strange rpc behavior as well in the following test run: https://github.com/redpanda-data/redpanda/issues/5471#issuecomment-1209505474

Basically there is a problem with completing rpc requests from docker-rp-13 to docker-rp-12 in the interval 03:07:35-03:07:45. What is strange is that both nodes are online, and the problem manifests only in one direction! I.e. requests from docker-rp-12 to docker-rp-13 go through just fine.

Any ideas how this could happen?

Aug 10 '22 02:08 ztlpn

@ztlpn

was this the same test that this ticket is tracking?
did you see the same Protocol violation?

Yesterday we changed the handling for the protocol violation so that it would be handled in a similar way to something like a crc check failure--ie close down the connection and rely on higher level to reconnect.

I will have a closer look. It is very odd that it only happens in one direction!

Aug 10 '22 03:08 dotnwat

started a new round of debug runs: https://github.com/redpanda-data/redpanda/pull/5931 still unable to reproduce locally

Aug 10 '22 04:08 dotnwat

Note here for myself

rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-2(2) example="ERROR 2022-08-12 04:38:07,862 [shard 1] 
rpc - transport.h:192 - Protocol violation: request version rpc::transport_version::v1 incompatible with reply version 
rpc::transport_version::v2 status rpc::status::success req_g 33 rep_g 33 req_corr 2 curr_corr 7 count 5470 ctx_count 5470 reply
type raft::append_entries_reply">

Extra context in logs strongly suggests that this isnt a concurrency bug as it looks like no transport resets are occurring between request starting and response being handled.

Moving on now to try and log all requests/responses and see if we can match them on the server side and client side.

It is still curious that the correlation IDs when this happen are small. Like 2-4. This is means that the request was sent out very shortly after a transport reset.

Aug 12 '22 05:08 dotnwat

@dotnwat what is the current status of this issue ?

Oct 03 '22 10:10 mmaslankaprv

Panda Results

In the last 30 days this test failed several times, but all with Unexpected files in data directory which seems to be something common. Maybe a clean-up issue:

[nwatkins@fedora pandaresults]$ cat log.txt | grep "Unexpected files in data di" | wc -l
229

log

[nwatkins@fedora pandaresults]$ env/bin/python3 main.py dev 30d 2>&1 | tee log.txt
[nwatkins@fedora pandaresults]$ cat log.txt | grep -A1 test_cancelling_partition_move_x_core
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=1.unclean_abort=False.recovery=no_recovery.compacted=True (1/119 runs)
  failure at 2022-10-11T07:31:23.471Z: RuntimeError('Unexpected files in data directory')
--
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=3.unclean_abort=True.recovery=no_recovery.compacted=False (1/119 runs)
  failure at 2022-10-11T07:31:23.471Z: RuntimeError('Unexpected files in data directory')
--
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=3.unclean_abort=True.recovery=no_recovery.compacted=True (1/119 runs)
  failure at 2022-10-11T07:31:23.471Z: RuntimeError('Unexpected files in data directory')
--
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=1.unclean_abort=False.recovery=no_recovery (2/458 runs)
  failure at 2022-09-26T20:33:53.624Z: RuntimeError('Unexpected files in data directory')
--
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=1.unclean_abort=False.recovery=restart_recovery (2/458 runs)
  failure at 2022-09-26T20:33:53.624Z: RuntimeError('Unexpected files in data directory')
--
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=1.unclean_abort=True.recovery=no_recovery (2/458 runs)
  failure at 2022-09-26T20:33:53.624Z: RuntimeError('Unexpected files in data directory')
--
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=3.unclean_abort=False.recovery=no_recovery (2/458 runs)
  failure at 2022-09-26T20:33:53.624Z: RuntimeError('Unexpected files in data directory')
--
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=1.unclean_abort=True.recovery=restart_recovery (2/458 runs)
  failure at 2022-09-26T20:33:53.624Z: RuntimeError('Unexpected files in data directory')
--
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=3.unclean_abort=True.recovery=no_recovery (2/458 runs)
  failure at 2022-09-26T20:33:53.624Z: RuntimeError('Unexpected files in data directory')
--
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=3.unclean_abort=False.recovery=restart_recovery (2/458 runs)
  failure at 2022-09-26T20:33:53.624Z: RuntimeError('Unexpected files in data directory')
--
OFAIL test: PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=3.unclean_abort=True.recovery=restart_recovery (2/458 runs)
  failure at 2022-09-26T20:33:53.624Z: RuntimeError('Unexpected files in data directory')

Test runs

https://github.com/redpanda-data/redpanda/pull/6782 removed ok_to_fail for this test (https://github.com/redpanda-data/redpanda/pull/6782/files)

99 green (1 non-related failure of CI infastructure) https://buildkite.com/redpanda/redpanda/builds/16730
100 green https://buildkite.com/redpanda/redpanda/builds/16746
150 green (https://buildkite.com/redpanda/redpanda/builds/16824)

Other evidence

On August 13 we merged a change which did not reset the correlation ID in the RPC because we were seeing unexplainable failures in which it appeared a race was occuring and the wrong response was being matched up with a request.

commit d4e122930e4170d7035162db7121ca2f12d26b29 (origin/rpc-skip-corr-reset)
Author: Noah Watkins <[email protected]>
Date:   Sat Aug 13 09:57:25 2022 -0700

    rpc: retain correlation idx across transport resets
    
    The rpc client transport works as an unordered stream of requests and
    responses. In order to match a response to a request each request
    contains a unique correlation id which the server includes the header of
    a response. The client tracks inflight requests by correlation id and
    matches the response using an index.

The last report of this issue was August 19 (for a PR that was up for weeks prior to the 19th) so could presumably have been reported from a CI run that was much closer to the 13th in time.

Related failure

This was also reported once on August 13 around the same time that the correlation ID fix went in.

https://github.com/redpanda-data/redpanda/issues/6020

Oct 14 '22 21:10 dotnwat

@jcsp i think this can be closed based on the previous comment's analysis. wdyt?

Oct 17 '22 19:10 dotnwat

I concur.

Oct 17 '22 20:10 jcsp

redpanda redpanda copied to clipboard

Failure in `PartitionMoveInterruption`.`test_cancelling_partition_move_x_core`

Panda Results

Test runs

Other evidence

Related failure

redpanda
redpanda copied to clipboard