redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

Failure in `EndToEndTopicRecovery`.`test_restore_with_aborted_tx` / `test_shadow_indexing_aborted_txs`

Open VladLazar opened this issue 3 years ago • 19 comments

Redpanda version: dev

  File "/root/tests/rptest/tests/e2e_topic_recovery_test.py", line 288, in test_restore_with_aborted_tx
    for p_key, (c_key, c_offset) in zip_longest(producer.keys, consumed):
TypeError: cannot unpack non-iterable NoneType object

The failure essentially means that we've consumed less messages than were produced.

https://buildkite.com/redpanda/redpanda/builds/17716#018433cb-f177-4b27-9fe4-e071ecf19f49 https://buildkite.com/redpanda/redpanda/builds/17663#01843078-297e-4672-b260-c73ad6969674

VladLazar avatar Nov 01 '22 13:11 VladLazar

Can please provide a link to failed build ?

mmaslankaprv avatar Nov 02 '22 11:11 mmaslankaprv

Can please provide a link to failed build ?

Sorry, missed that somehow. Updated the issue.

VladLazar avatar Nov 02 '22 11:11 VladLazar

https://buildkite.com/redpanda/redpanda/builds/17716#018433cb-f177-4b27-9fe4-e071ecf19f49

dlex avatar Nov 02 '22 16:11 dlex

Looks like the recovery downloads the segment correctly but then it gets truncated right after raft bootstrap.

Lazin avatar Nov 03 '22 18:11 Lazin

https://buildkite.com/redpanda/redpanda/builds/17898#01844128-535b-4789-9063-333aa64e6d4c https://buildkite.com/redpanda/redpanda/builds/17906#01844252-a6c3-404f-bd53-f187581fbca0 https://buildkite.com/redpanda/redpanda/builds/17895#0184410d-c880-49d2-ae2f-a9b90a8b3aa8

dlex avatar Nov 04 '22 15:11 dlex

https://buildkite.com/redpanda/redpanda/builds/18018#018452c7-44cf-4178-981e-ca6b739b4a3a/6-1158

NyaliaLui avatar Nov 07 '22 19:11 NyaliaLui

Another in https://buildkite.com/redpanda/redpanda/builds/18143#0184586f-918e-4f87-ac6e-c179c53438ac/6-680

NyaliaLui avatar Nov 09 '22 14:11 NyaliaLui

https://buildkite.com/redpanda/redpanda/builds/18290#01845eaf-1a8e-455b-81fa-294d69c8301c/6-732

NyaliaLui avatar Nov 10 '22 14:11 NyaliaLui

https://buildkite.com/redpanda/redpanda/builds/18463#01846631-9be1-45a6-b5fe-028eb8f3d59d/6-451

NyaliaLui avatar Nov 11 '22 15:11 NyaliaLui

There is also a different failure but same test in

https://buildkite.com/redpanda/redpanda/builds/18401#01846341-7814-4a89-ac1b-a5a67ce4b863/6-532

   AssertionError("produced and consumed messages differ, produced length: 13254, consumed length: 0, first mismatch: produced: b'0', consumed: None (offset: -1)")
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/shadow_indexing_tx_test.py", line 132, in test_shadow_indexing_aborted_txs
    assert (not first_mismatch), (
AssertionError: produced and consumed messages differ, produced length: 13254, consumed length: 0, first mismatch: produced: b'0', consumed: None (offset: -1)

NyaliaLui avatar Nov 11 '22 15:11 NyaliaLui

seen a couple of times on a ci-repeat in https://github.com/redpanda-data/redpanda/pull/7242

https://buildkite.com/redpanda/redpanda/builds/18462#01846631-7346-4aac-97ab-b2b489c2cf85 https://buildkite.com/redpanda/redpanda/builds/18462#01846631-7348-4f66-8c63-92500bfc3a34

test_id:    rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.bytes.1024.redpanda.remote.write.True.redpanda.remote.read.True
status:     FAIL
run time:   1 minute 11.075 seconds
 
    TypeError('cannot unpack non-iterable NoneType object')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/e2e_topic_recovery_test.py", line 313, in test_restore_with_aborted_tx
    for p_key, (c_key, c_offset) in zip_longest(producer.keys, consumed):
TypeError: cannot unpack non-iterable NoneType object

abhijat avatar Nov 14 '22 07:11 abhijat

again this failure

AssertionError("produced and consumed messages differ, produced length: 13765, consu    med length: 13774, first mismatch: produced: None, consumed: b'14866' (offset: 15265)")

https://buildkite.com/redpanda/redpanda/builds/18524#01846f6f-36d1-4107-b9e0-02cf1ab05e0a

andijcr avatar Nov 14 '22 15:11 andijcr

in the release-clang-amd64 https://buildkite.com/redpanda/redpanda/builds/18587#018479bc-e5b7-4777-a76d-322555029e92 but this time seems a problem with the test itself TypeError('cannot unpack non-iterable NoneType object')

andijcr avatar Nov 15 '22 15:11 andijcr

Seen another here: https://buildkite.com/redpanda/redpanda/builds/18618#01847be9-e191-447b-80a7-45df27069737. Same failure mode as in the original issue.

VladLazar avatar Nov 16 '22 10:11 VladLazar

I'll work on it today. Looks like the test consumed more data then it produced which may indicate that the consumer was able to see batches from aborted transactions.

On Wed, Nov 16, 2022, 11:23 Vlad Lazar @.***> wrote:

Seen another here: https://buildkite.com/redpanda/redpanda/builds/18618#01847be9-e191-447b-80a7-45df27069737 . Same failure mode as in the original issue.

— Reply to this email directly, view it on GitHub https://github.com/redpanda-data/redpanda/issues/7043#issuecomment-1316754212, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWMNUQJYCCI6FYLN6QW7LWISYZRANCNFSM6AAAAAARUCHUTI . You are receiving this because you were assigned.Message ID: @.***>

Lazin avatar Nov 16 '22 10:11 Lazin

another one https://buildkite.com/redpanda/redpanda/builds/18587#018479bc-e5b9-4a0e-b59b-e01d2e099b3e

FAIL test: EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.bytes.1024.redpanda.remote.writ e.True.redpanda.remote.read.True (1/30 runs) failure at 2022-11-15T08:06:49.706Z: TypeError('cannot unpack non-iterable NoneType object') in job https://buildkite.com/redpanda/redpanda/builds/18587#018479bc-e5b9-4a0e-b59b-e01d2e099b3e

andijcr avatar Nov 16 '22 16:11 andijcr

  5 FAIL test: EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.bytes.1024.redpanda.remote.writ    e.True.redpanda.remote.read.True (1/18 runs)
  6   failure at 2022-11-18T08:14:27.356Z: TypeError('cannot unpack non-iterable NoneType object')
  7       in job https://buildkite.com/redpanda/redpanda/builds/18803#01848939-1855-4c6e-9bf6-fe9911044239

@ https://buildkite.com/redpanda/redpanda/builds/18803#01848939-1855-4c6e-9bf6-fe9911044239

andijcr avatar Nov 18 '22 15:11 andijcr

this one on arm: 14 FAIL test: ShadowIndexingTxTest.test_shadow_indexing_aborted_txs (1/18 runs) 15 failure at 2022-11-18T07:58:55.995Z: AssertionError("produced and consumed messages differ, produced length: 13821, consu med length: 2346, first mismatch: produced: b'2483', consumed: None (offset: -1)") 16 in job https://buildkite.com/redpanda/redpanda/builds/18804#01848941-599a-4491-8a5b-8fec91520610

the first error:

[ERROR - 2022-11-18 07:50:13,454 - cluster - wrapped - lineno:41]: Test failed, doing failure checks...
Traceback (most recent call last):
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/shadow_indexing_tx_test.py", line 132, in test_shadow_indexing_aborted_txs
    assert (not first_mismatch), (

andijcr avatar Nov 18 '22 15:11 andijcr

FAIL test: EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.bytes.1024.redpanda.remote.write.True.redpanda.remote.read.True (2/50 runs) failure at 2022-11-21T07:41:53.231Z: TypeError('cannot unpack non-iterable NoneType object') in job https://buildkite.com/redpanda/redpanda/builds/18848#018498a2-1793-4ba8-af2f-35a508bfe19f failure at 2022-11-19T07:42:30.317Z: TypeError('cannot unpack non-iterable NoneType object') in job https://buildkite.com/redpanda/redpanda/builds/18828#01848e55-6349-4a70-a749-c9776019b3ac

jcsp avatar Nov 21 '22 12:11 jcsp

Analyzing this failure: https://buildkite.com/redpanda/redpanda/builds/19009#0184a698-b536-4473-be39-21e5bc9c8628

This looks like a hang in remote partition read path. It is busy spinning like this:

DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber18 kafka/topic-ynmjqwyivv/0] - remote_partition.cc:176 - Invoking 'read_some' on current log reader with config: {start_offset:{7724}, max_offset:{15212}, min_bytes:0, max_bytes:1048576, type_filter:batch_type::raft_data, first_timestamp:nullopt}
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:1004 - remote_segment_batch_reader::init_parser, start_offset: 7724
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5 kafka/topic-ynmjqwyivv/0 [6444:7730]] - remote_segment.cc:190 - remote segment file input stream at offset 7724
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5 kafka/topic-ynmjqwyivv/0 [6444:7730]] - remote_segment.cc:674 - segment {"ffef87a3/kafka/topic-ynmjqwyivv/0_19/6444-7730-1048663-1-v1.log.1"} hydration requested
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34429 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:1029 - remote_segment_batch_reader::stop
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34429 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:1032 - remote_segment_batch_reader::stop - parser-close
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5 kafka/topic-ynmjqwyivv/0 [6444:7730]] - remote_segment.cc:557 - Segment {"ffef87a3/kafka/topic-ynmjqwyivv/0_19/6444-7730-1048663-1-v1.log.1"} requested, 1 consumers are awaiting, data file is available
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5 kafka/topic-ynmjqwyivv/0 [6444:7730]] - remote_segment.cc:244 - Using index to locate 7724, the result is rp-offset: 7670, kafka-offset: 7664, file-pos: 996607
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:826 - accept_batch_start skip because last_kafka_offset 7675 (last_rp_offset: 7681) < config.start_offset: 7724
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:871 - skip_batch_start called for 7670
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:826 - accept_batch_start skip because last_kafka_offset 7685 (last_rp_offset: 7691) < config.start_offset: 7724
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:871 - skip_batch_start called for 7682
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:826 - accept_batch_start skip because last_kafka_offset 7694 (last_rp_offset: 7700) < config.start_offset: 7724
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:871 - skip_batch_start called for 7692
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:826 - accept_batch_start skip because last_kafka_offset 7708 (last_rp_offset: 7714) < config.start_offset: 7724
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:871 - skip_batch_start called for 7701
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:826 - accept_batch_start skip because last_kafka_offset 7720 (last_rp_offset: 7726) < config.start_offset: 7724
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:871 - skip_batch_start called for 7715
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:826 - accept_batch_start skip because last_kafka_offset 7722 (last_rp_offset: 7728) < config.start_offset: 7724
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:871 - skip_batch_start called for 7727
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:826 - accept_batch_start skip because last_kafka_offset 7723 (last_rp_offset: 7729) < config.start_offset: 7724
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:871 - skip_batch_start called for 7729
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:811 - accept_batch_start skip because record batch type is batch_type::tx_fence
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2~5~34430~0 kafka/topic-ynmjqwyivv/0] - remote_segment.cc:871 - skip_batch_start called for 7730
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber18 kafka/topic-ynmjqwyivv/0] - remote_partition.cc:323 - maybe_reset_reader called
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber18 kafka/topic-ynmjqwyivv/0] - remote_partition.cc:344 - maybe_reset_reader, config start_offset: 7724, reader max_offset: 7730
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber18 kafka/topic-ynmjqwyivv/0] - remote_partition.cc:348 - maybe_reset_stream condition triggered
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber18 kafka/topic-ynmjqwyivv/0] - remote_partition.cc:370 - initializing new segment reader
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber18 kafka/topic-ynmjqwyivv/0] - remote_partition.cc:385 - initializing new segment reader 7724 7725
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber2 kafka/topic-ynmjqwyivv/0] - segment_state.cc:73 - creating new reader, config: {start_offset:{7724}, max_offset:{15212}, min_bytes:0, max_bytes:1048576, type_filter:batch_type::raft_data, first_timestamp:nullopt}
DEBUG 2022-11-23 23:33:17,579 [shard 1] cloud_storage - [fiber18 kafka/topic-ynmjqwyivv/0] - remote_partition.cc:410 - maybe_reset_reader completed, reader is present: true, is end of stream: false

jcsp avatar Nov 24 '22 09:11 jcsp

So the reader is asking for kafka offset 7724, which corresponds to a tx_fench batch. We're finding the offset, but skipping it because we use a raft_data filter, and then looping forever trying to look up the offset, finding the same segment, traversing it but ignoring the match we find.

jcsp avatar Nov 24 '22 10:11 jcsp

FAIL test: ShadowIndexingTxTest.test_shadow_indexing_aborted_txs (2/21 runs) failure at 2022-11-25T08:01:45.162Z: AssertionError("produced and consumed messages differ, produced length: 13875, consumed length: 1162, first mismatch: produced: b'1258', consumed: None (offset: -1)") on (arm64, container) in job https://buildkite.com/redpanda/redpanda/builds/19082#0184ad4d-dc37-49eb-ae7e-6588f6a7f85a failure at 2022-11-25T08:22:12.464Z: AssertionError("produced and consumed messages differ, produced length: 13155, consumed length: 8668, first mismatch: produced: b'9741', consumed: None (offset: -1)") on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/19082#0184ad4d-dc3a-4868-83ca-09ab8595c673

jcsp avatar Nov 25 '22 09:11 jcsp

https://buildkite.com/redpanda/redpanda/builds/19140#0184b64e-6f8e-44a8-99a0-870ca1e100a5/6-273 another occurrence

RafalKorepta avatar Nov 27 '22 09:11 RafalKorepta

FAIL test: ShadowIndexingTxTest.test_shadow_indexing_aborted_txs (3/62 runs) failure at 2022-11-26T07:45:41.678Z: AssertionError("produced and consumed messages differ, produced length: 13601, consumed length: 6458, first mismatch: produced: b'7414', consumed: None (offset: -1)") on (arm64, container) in job https://buildkite.com/redpanda/redpanda/builds/19135#0184b261-e74e-4c74-a8d8-e2482a8c87a0

mmaslankaprv avatar Nov 28 '22 10:11 mmaslankaprv

Another instance: https://ci-artifacts.dev.vectorized.cloud/redpanda/0184f328-8dd6-44ca-b4c1-688c8a356b8d/vbuild/ducktape/results/2022-12-08--001/report.html

bharathv avatar Dec 08 '22 20:12 bharathv

Another instance related to "produced and cnsumed msesages differ" reported above by Nyalia

https://buildkite.com/redpanda/redpanda/builds/19555#0184f32e-7e57-43f3-aeb4-4ce620d01df5

dotnwat avatar Dec 09 '22 19:12 dotnwat

This is still live today, after the merge of #7366 https://buildkite.com/redpanda/redpanda/builds/19621#018504dc-4135-4db7-968d-778a14c6c14f

jcsp avatar Dec 12 '22 14:12 jcsp

Another https://buildkite.com/redpanda/redpanda/builds/19621#018504dc-4135-4db7-968d-778a14c6c14f

with TypeError('cannot unpack non-iterable NoneType object')

VadimPlh avatar Dec 12 '22 15:12 VadimPlh

https://buildkite.com/redpanda/redpanda/builds/19686#01850c1a-44b5-4130-8fd6-fa3e72657613

VadimPlh avatar Dec 14 '22 15:12 VadimPlh

This surfaced again after including the fixes from #7819

version: v23.1.0-dev-639-g85cc3500b - 85cc3500bfaeb4c11745e0268dc3a186dd6379bb-dirty

https://buildkite.com/redpanda/vtools/builds/4801#018533b7-87a2-4c64-b23b-db696e5251de

====================================================================================================
test_id:    rptest.tests.shadow_indexing_tx_test.ShadowIndexingTxTest.test_shadow_indexing_aborted_txs
status:     FAIL
run time:   46.286 seconds


    AssertionError()
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/shadow_indexing_tx_test.py", line 75, in test_shadow_indexing_aborted_txs
    assert 0 < committed_messages < msg_count
AssertionError

bharathv avatar Dec 21 '22 17:12 bharathv