redpanda
redpanda copied to clipboard
Failure in `EndToEndTopicRecovery`.`test_restore_with_aborted_tx` / `test_shadow_indexing_aborted_txs`
Redpanda version: dev
File "/root/tests/rptest/tests/e2e_topic_recovery_test.py", line 288, in test_restore_with_aborted_tx
for p_key, (c_key, c_offset) in zip_longest(producer.keys, consumed):
TypeError: cannot unpack non-iterable NoneType object
The failure essentially means that we've consumed less messages than were produced.
https://buildkite.com/redpanda/redpanda/builds/17716#018433cb-f177-4b27-9fe4-e071ecf19f49 https://buildkite.com/redpanda/redpanda/builds/17663#01843078-297e-4672-b260-c73ad6969674
Can please provide a link to failed build ?
Can please provide a link to failed build ?
Sorry, missed that somehow. Updated the issue.
https://buildkite.com/redpanda/redpanda/builds/17716#018433cb-f177-4b27-9fe4-e071ecf19f49
Looks like the recovery downloads the segment correctly but then it gets truncated right after raft bootstrap.
https://buildkite.com/redpanda/redpanda/builds/17898#01844128-535b-4789-9063-333aa64e6d4c https://buildkite.com/redpanda/redpanda/builds/17906#01844252-a6c3-404f-bd53-f187581fbca0 https://buildkite.com/redpanda/redpanda/builds/17895#0184410d-c880-49d2-ae2f-a9b90a8b3aa8
https://buildkite.com/redpanda/redpanda/builds/18018#018452c7-44cf-4178-981e-ca6b739b4a3a/6-1158
Another in https://buildkite.com/redpanda/redpanda/builds/18143#0184586f-918e-4f87-ac6e-c179c53438ac/6-680
https://buildkite.com/redpanda/redpanda/builds/18290#01845eaf-1a8e-455b-81fa-294d69c8301c/6-732
https://buildkite.com/redpanda/redpanda/builds/18463#01846631-9be1-45a6-b5fe-028eb8f3d59d/6-451
There is also a different failure but same test in
https://buildkite.com/redpanda/redpanda/builds/18401#01846341-7814-4a89-ac1b-a5a67ce4b863/6-532
AssertionError("produced and consumed messages differ, produced length: 13254, consumed length: 0, first mismatch: produced: b'0', consumed: None (offset: -1)")
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
r = f(self, *args, **kwargs)
File "/root/tests/rptest/tests/shadow_indexing_tx_test.py", line 132, in test_shadow_indexing_aborted_txs
assert (not first_mismatch), (
AssertionError: produced and consumed messages differ, produced length: 13254, consumed length: 0, first mismatch: produced: b'0', consumed: None (offset: -1)
seen a couple of times on a ci-repeat in https://github.com/redpanda-data/redpanda/pull/7242
https://buildkite.com/redpanda/redpanda/builds/18462#01846631-7346-4aac-97ab-b2b489c2cf85 https://buildkite.com/redpanda/redpanda/builds/18462#01846631-7348-4f66-8c63-92500bfc3a34
test_id: rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.bytes.1024.redpanda.remote.write.True.redpanda.remote.read.True
status: FAIL
run time: 1 minute 11.075 seconds
TypeError('cannot unpack non-iterable NoneType object')
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
r = f(self, *args, **kwargs)
File "/root/tests/rptest/tests/e2e_topic_recovery_test.py", line 313, in test_restore_with_aborted_tx
for p_key, (c_key, c_offset) in zip_longest(producer.keys, consumed):
TypeError: cannot unpack non-iterable NoneType object
again this failure
AssertionError("produced and consumed messages differ, produced length: 13765, consu med length: 13774, first mismatch: produced: None, consumed: b'14866' (offset: 15265)")
https://buildkite.com/redpanda/redpanda/builds/18524#01846f6f-36d1-4107-b9e0-02cf1ab05e0a
in the release-clang-amd64 https://buildkite.com/redpanda/redpanda/builds/18587#018479bc-e5b7-4777-a76d-322555029e92
but this time seems a problem with the test itself TypeError('cannot unpack non-iterable NoneType object')
Seen another here: https://buildkite.com/redpanda/redpanda/builds/18618#01847be9-e191-447b-80a7-45df27069737. Same failure mode as in the original issue.
I'll work on it today. Looks like the test consumed more data then it produced which may indicate that the consumer was able to see batches from aborted transactions.
On Wed, Nov 16, 2022, 11:23 Vlad Lazar @.***> wrote:
Seen another here: https://buildkite.com/redpanda/redpanda/builds/18618#01847be9-e191-447b-80a7-45df27069737 . Same failure mode as in the original issue.
— Reply to this email directly, view it on GitHub https://github.com/redpanda-data/redpanda/issues/7043#issuecomment-1316754212, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWMNUQJYCCI6FYLN6QW7LWISYZRANCNFSM6AAAAAARUCHUTI . You are receiving this because you were assigned.Message ID: @.***>
another one https://buildkite.com/redpanda/redpanda/builds/18587#018479bc-e5b9-4a0e-b59b-e01d2e099b3e
FAIL test: EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.bytes.1024.redpanda.remote.writ e.True.redpanda.remote.read.True (1/30 runs) failure at 2022-11-15T08:06:49.706Z: TypeError('cannot unpack non-iterable NoneType object') in job https://buildkite.com/redpanda/redpanda/builds/18587#018479bc-e5b9-4a0e-b59b-e01d2e099b3e
5 FAIL test: EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.bytes.1024.redpanda.remote.writ e.True.redpanda.remote.read.True (1/18 runs)
6 failure at 2022-11-18T08:14:27.356Z: TypeError('cannot unpack non-iterable NoneType object')
7 in job https://buildkite.com/redpanda/redpanda/builds/18803#01848939-1855-4c6e-9bf6-fe9911044239
@ https://buildkite.com/redpanda/redpanda/builds/18803#01848939-1855-4c6e-9bf6-fe9911044239
this one on arm: 14 FAIL test: ShadowIndexingTxTest.test_shadow_indexing_aborted_txs (1/18 runs) 15 failure at 2022-11-18T07:58:55.995Z: AssertionError("produced and consumed messages differ, produced length: 13821, consu med length: 2346, first mismatch: produced: b'2483', consumed: None (offset: -1)") 16 in job https://buildkite.com/redpanda/redpanda/builds/18804#01848941-599a-4491-8a5b-8fec91520610
the first error:
[ERROR - 2022-11-18 07:50:13,454 - cluster - wrapped - lineno:41]: Test failed, doing failure checks...
Traceback (most recent call last):
File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
r = f(self, *args, **kwargs)
File "/root/tests/rptest/tests/shadow_indexing_tx_test.py", line 132, in test_shadow_indexing_aborted_txs
assert (not first_mismatch), (
FAIL test: EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.bytes.1024.redpanda.remote.write.True.redpanda.remote.read.True (2/50 runs) failure at 2022-11-21T07:41:53.231Z: TypeError('cannot unpack non-iterable NoneType object') in job https://buildkite.com/redpanda/redpanda/builds/18848#018498a2-1793-4ba8-af2f-35a508bfe19f failure at 2022-11-19T07:42:30.317Z: TypeError('cannot unpack non-iterable NoneType object') in job https://buildkite.com/redpanda/redpanda/builds/18828#01848e55-6349-4a70-a749-c9776019b3ac
Analyzing this failure: https://buildkite.com/redpanda/redpanda/builds/19009#0184a698-b536-4473-be39-21e5bc9c8628
This looks like a hang in remote partition read path. It is busy spinning like this:
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber18 kafka/topic-ynmjqwyivv/0] - remote_partition.cc:176 - Invoking 'read_some' on current log reader with config: {start_offset:{7724}, max_offset:{15212}, min_bytes:0, max_bytes:1048576, type_filter:batch_type::raft_data, first_timestamp:nullopt}
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:1004 - remote_segment_batch_reader::init_parser, start_offset: 7724
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5 kafka/topic-ynmjqwyivv/0 [6444:7730]] - remote_segment.cc:190 - remote segment file input stream at offset 7724
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5 kafka/topic-ynmjqwyivv/0 [6444:7730]] - remote_segment.cc:674 - segment {"ffef87a3/kafka/topic-ynmjqwyivv/0_19/6444-7730-1048663-1-v1.log.1"} hydration requested
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34429 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:1029 - remote_segment_batch_reader::stop
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34429 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:1032 - remote_segment_batch_reader::stop - parser-close
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5 kafka/topic-ynmjqwyivv/0 [6444:7730]] - remote_segment.cc:557 - Segment {"ffef87a3/kafka/topic-ynmjqwyivv/0_19/6444-7730-1048663-1-v1.log.1"} requested, 1 consumers are awaiting, data file is available
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5 kafka/topic-ynmjqwyivv/0 [6444:7730]] - remote_segment.cc:244 - Using index to locate 7724, the result is rp-offset: 7670, kafka-offset: 7664, file-pos: 996607
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:826 - accept_batch_start skip because last_kafka_offset 7675 (last_rp_offset: 7681) < config.start_offset: 7724
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:871 - skip_batch_start called for 7670
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:826 - accept_batch_start skip because last_kafka_offset 7685 (last_rp_offset: 7691) < config.start_offset: 7724
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:871 - skip_batch_start called for 7682
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:826 - accept_batch_start skip because last_kafka_offset 7694 (last_rp_offset: 7700) < config.start_offset: 7724
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:871 - skip_batch_start called for 7692
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:826 - accept_batch_start skip because last_kafka_offset 7708 (last_rp_offset: 7714) < config.start_offset: 7724
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:871 - skip_batch_start called for 7701
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:826 - accept_batch_start skip because last_kafka_offset 7720 (last_rp_offset: 7726) < config.start_offset: 7724
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:871 - skip_batch_start called for 7715
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:826 - accept_batch_start skip because last_kafka_offset 7722 (last_rp_offset: 7728) < config.start_offset: 7724
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:871 - skip_batch_start called for 7727
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:826 - accept_batch_start skip because last_kafka_offset 7723 (last_rp_offset: 7729) < config.start_offset: 7724
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:871 - skip_batch_start called for 7729
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:811 - accept_batch_start skip because record batch type is batch_type::tx_fence
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:871 - skip_batch_start called for 7730
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber18 kafka/topic-ynmjqwyivv/0] - remote_partition.cc:323 - maybe_reset_reader called
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber18 kafka/topic-ynmjqwyivv/0] - remote_partition.cc:344 - maybe_reset_reader, config start_offset: 7724, reader max_offset: 7730
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber18 kafka/topic-ynmjqwyivv/0] - remote_partition.cc:348 - maybe_reset_stream condition triggered
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber18 kafka/topic-ynmjqwyivv/0] - remote_partition.cc:370 - initializing new segment reader
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber18 kafka/topic-ynmjqwyivv/0] - remote_partition.cc:385 - initializing new segment reader 7724 7725
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2 kafka/topic-ynmjqwyivv/0] - segment_state.cc:73 - creating new reader, config: {start_offset:{7724}, max_offset:{15212}, min_bytes:0, max_bytes:1048576, type_filter:batch_type::raft_data, first_timestamp:nullopt}
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber18 kafka/topic-ynmjqwyivv/0] - remote_partition.cc:410 - maybe_reset_reader completed, reader is present: true, is end of stream: false
So the reader is asking for kafka offset 7724, which corresponds to a tx_fench batch. We're finding the offset, but skipping it because we use a raft_data filter, and then looping forever trying to look up the offset, finding the same segment, traversing it but ignoring the match we find.
FAIL test: ShadowIndexingTxTest.test_shadow_indexing_aborted_txs (2/21 runs) failure at 2022-11-25T08:01:45.162Z: AssertionError("produced and consumed messages differ, produced length: 13875, consumed length: 1162, first mismatch: produced: b'1258', consumed: None (offset: -1)") on (arm64, container) in job https://buildkite.com/redpanda/redpanda/builds/19082#0184ad4d-dc37-49eb-ae7e-6588f6a7f85a failure at 2022-11-25T08:22:12.464Z: AssertionError("produced and consumed messages differ, produced length: 13155, consumed length: 8668, first mismatch: produced: b'9741', consumed: None (offset: -1)") on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/19082#0184ad4d-dc3a-4868-83ca-09ab8595c673
https://buildkite.com/redpanda/redpanda/builds/19140#0184b64e-6f8e-44a8-99a0-870ca1e100a5/6-273 another occurrence
FAIL test: ShadowIndexingTxTest.test_shadow_indexing_aborted_txs (3/62 runs) failure at 2022-11-26T07:45:41.678Z: AssertionError("produced and consumed messages differ, produced length: 13601, consumed length: 6458, first mismatch: produced: b'7414', consumed: None (offset: -1)") on (arm64, container) in job https://buildkite.com/redpanda/redpanda/builds/19135#0184b261-e74e-4c74-a8d8-e2482a8c87a0
Another instance: https://ci-artifacts.dev.vectorized.cloud/redpanda/0184f328-8dd6-44ca-b4c1-688c8a356b8d/vbuild/ducktape/results/2022-12-08--001/report.html
Another instance related to "produced and cnsumed msesages differ" reported above by Nyalia
https://buildkite.com/redpanda/redpanda/builds/19555#0184f32e-7e57-43f3-aeb4-4ce620d01df5
This is still live today, after the merge of #7366 https://buildkite.com/redpanda/redpanda/builds/19621#018504dc-4135-4db7-968d-778a14c6c14f
Another https://buildkite.com/redpanda/redpanda/builds/19621#018504dc-4135-4db7-968d-778a14c6c14f
with TypeError('cannot unpack non-iterable NoneType object')
https://buildkite.com/redpanda/redpanda/builds/19686#01850c1a-44b5-4130-8fd6-fa3e72657613
This surfaced again after including the fixes from #7819
version: v23.1.0-dev-639-g85cc3500b - 85cc3500bfaeb4c11745e0268dc3a186dd6379bb-dirty
https://buildkite.com/redpanda/vtools/builds/4801#018533b7-87a2-4c64-b23b-db696e5251de
====================================================================================================
test_id: rptest.tests.shadow_indexing_tx_test.ShadowIndexingTxTest.test_shadow_indexing_aborted_txs
status: FAIL
run time: 46.286 seconds
AssertionError()
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
r = f(self, *args, **kwargs)
File "/home/ubuntu/redpanda/tests/rptest/utils/mode_checks.py", line 63, in f
return func(*args, **kwargs)
File "/home/ubuntu/redpanda/tests/rptest/tests/shadow_indexing_tx_test.py", line 75, in test_shadow_indexing_aborted_txs
assert 0 < committed_messages < msg_count
AssertionError