redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

archival_stm snapshot: init _last_clean_at as -inf when applying a dirty snapshot

Open bashtanov opened this issue 9 months ago • 20 comments

~This PR includes changes from https://github.com/redpanda-data/redpanda/pull/25728 for the test to pass, only two last commits are relevant.~

I made this change when debugging https://redpandadata.atlassian.net/browse/CORE-9506. The issue is when a topic is deleted quickly after creation the broker sometimes errors with Failed to fetch manifest during finalize()) as it expects to download partition manifest that has never been uploaded.

Questions to Storage team:

  1. Does the change make sense? I'm not familiar with the logic of this STM, e.g. _last_dirty_offset is not changed on snapshot application at all, is that right?
  2. The test (last commit) passes but only if scrubbing is turned off. Should we aim for scrubbing to pass as well or is it expected to have some anomalies if we're recreating topics so wildly?

Backports Required

  • [ ] none - not a bug fix
  • [ ] none - this is a backport
  • [ ] none - issue does not exist in previous branches
  • [ ] none - papercut/not impactful enough to backport
  • [ ] v25.1.x
  • [ ] v24.3.x
  • [ ] v24.2.x
  • [ ] v24.1.x

Release Notes

bashtanov avatar Apr 11 '25 14:04 bashtanov

@bashtanov is this still draft?

nvartolomei avatar Apr 14 '25 15:04 nvartolomei

@nvartolomei I'm looking for a review indeed. But I'm rather after a discussion than for an approval as it is.

bashtanov avatar Apr 14 '25 15:04 bashtanov

converting to ready-for-review for it to show up in people's to-review lists

bashtanov avatar Apr 14 '25 16:04 bashtanov

CI test results

test results on build#64496
test_id test_kind job_url test_status passed
rptest.tests.datalake.throttling_test.DatalakeThrottlingTest.test_basic_throttling.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.NESSIE ducktape https://buildkite.com/redpanda/redpanda/builds/64496#01963562-04e8-4bc2-bcbe-2065f70ce749 FLAKY 1/2
rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False ducktape https://buildkite.com/redpanda/redpanda/builds/64496#01963562-04e8-429c-abdf-ee973e368c00 FLAKY 1/2
rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False ducktape https://buildkite.com/redpanda/redpanda/builds/64496#01963575-8c91-4390-a8a0-210bd1073616 FLAKY 1/4
test results on build#65179
test_id test_kind job_url test_status passed
rptest.tests.controller_snapshot_test.ControllerSnapshotTest.test_join_restart_catch_up ducktape https://buildkite.com/redpanda/redpanda/builds/65179#01967cec-98b2-40cc-922e-a949731c68a7 FLAKY 20/21
rptest.tests.maintenance_test.MaintenanceTest.test_maintenance_sticky.use_rpk=True ducktape https://buildkite.com/redpanda/redpanda/builds/65179#01967cec-98b2-40cc-922e-a949731c68a7 FLAKY 20/21
rptest.tests.simple_e2e_test.SimpleEndToEndTest.test_consumer_interruption ducktape https://buildkite.com/redpanda/redpanda/builds/65179#01967cec-98b1-42f8-82dd-c4f977fa6177 FLAKY 16/21
rptest.tests.simple_e2e_test.SimpleEndToEndTest.test_consumer_interruption ducktape https://buildkite.com/redpanda/redpanda/builds/65179#01967cef-f00d-42f7-9935-55bd5d3c5043 FLAKY 10/21
rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False ducktape https://buildkite.com/redpanda/redpanda/builds/65179#01967cef-f00e-4508-a817-a3d558bd3bcb FLAKY 15/21
test results on build#66848
test_class test_method test_arguments test_kind job_url test_status passed reason
DatalakeBatchingTest test_batching {"catalog_type": "rest_jdbc", "cloud_storage_type": 1, "expect_large_files": true, "query_engine": "spark"} ducktape https://buildkite.com/redpanda/redpanda/builds/66848#01973c21-0518-48d2-a2f5-ff3f647fcfcd FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
NodeWiseRecoveryTest test_recovery_local_data_missing ducktape https://buildkite.com/redpanda/redpanda/builds/66848#01973c21-0518-48d2-a2f5-ff3f647fcfcd FAIL 0/1 The test has failed across all retries
PartitionReassignmentsTest test_add_partitions_with_inprogress_reassignments ducktape https://buildkite.com/redpanda/redpanda/builds/66848#01973c1e-1a87-4dfe-83ce-fe8c607f6f54 FLAKY 19/21 upstream reliability is '97.61904761904762'. current run reliability is '90.47619047619048'. drift is 7.14286 and the allowed drift is set to 50. The test should PASS
test results on build#66917
test_class test_method test_arguments test_kind job_url test_status passed reason
NodePoolMigrationTest test_migrating_redpanda_nodes_to_new_pool {"balancing_mode": "node_add", "cleanup_policy": "compact,delete", "test_mode": "tiered_storage_fast_moves"} ducktape https://buildkite.com/redpanda/redpanda/builds/66917#01973f8e-a239-41a1-8c17-7784d3721ef9 FLAKY 19/21 upstream reliability is '99.6'. current run reliability is '90.47619047619048'. drift is 9.12381 and the allowed drift is set to 50. The test should PASS
src/v/cluster/tests/tx_compaction_tests src/v/cluster/tests/tx_compaction_tests unit https://buildkite.com/redpanda/redpanda/builds/66917#01973f5c-c0db-4d0e-8652-0175f88445e9 FAIL 0/1

vbotbuildovich avatar Apr 14 '25 20:04 vbotbuildovich

@nvartolomei Thanks very much. I'm sorry I must have confused you with this PR. As stated in the description I'm mostly interested concerned about the last 2 commits, the penultimate one in particular. In any case I'll address your comments, thanks again!

bashtanov avatar Apr 15 '25 16:04 bashtanov

I think that we can detect a situation when the manifest was never uploaded by checking the error code. If S3 GET request returned NoSuchKey error it's guaranteed that the manifest was never uploaded which means that the partition should be empty and there is nothing to finalise.

Well. We don't want to rely on that not found error where we can as it might be mistaken. Furthermore, that error is the reason why we found the bug leading to this PR and to the linked one https://github.com/redpanda-data/redpanda/pull/25728. We were creating STMs with incorrect configs...

nvartolomei avatar Apr 16 '25 14:04 nvartolomei

We don't want to rely on that not found error where we can as it might be mistaken.

How exactly it could be mistaken? You basically running a database query (S3 API request) and it tells you that there is no such key. This can only happen if there is indeed no such key. We're relying on this during recovery process I believe. I'd prefer to have clean-at/dirty-at in the snapshot for sure but it might be tricky to do this in the minor release.

Lazin avatar Apr 16 '25 14:04 Lazin

How exactly it could be mistaken?

We are requesting the wrong path. This is what happens in the buggy scenario we currently have. The fact that we can leverage stm information (except when it is fresh from a snapshot) to inform whether not found is expected or not allows us to prevent/error on such bugs. This is quite valuable it turns out. We caught a pretty bad bug by having such double checks.

nvartolomei avatar Apr 16 '25 16:04 nvartolomei

What's the state of this PR @bashtanov? Seems like it might be necessary for CI health at the moment

WillemKauf avatar Apr 25 '25 18:04 WillemKauf

:tada: Snyk checks have passed. No issues have been found so far.

:white_check_mark: security/snyk check is complete. No issues have been found. (View Details)

:white_check_mark: license/snyk check is complete. No issues have been found. (View Details)

secpanda avatar Apr 28 '25 11:04 secpanda

:tada: Snyk checks have passed. No issues have been found so far.

:white_check_mark: security/snyk check is complete. No issues have been found. (View Details)

:white_check_mark: license/snyk check is complete. No issues have been found. (View Details)

secpanda avatar Apr 28 '25 11:04 secpanda

@WillemKauf the commits not related to the subject change have been merged in https://github.com/redpanda-data/redpanda/pull/25728, this should already improve tests stability but unfortunately will not eliminate failures completely.

@Lazin @nvartolomei I added saving and restoring of last clean and dirty offsets into local snapshots. It looks straightforward thanks to serde structure backwards compatibility mechanism, but we still need to fall back to default values if they are missing in the snapshot.

bashtanov avatar Apr 28 '25 13:04 bashtanov

Retry command for Build#65179

please wait until all jobs are finished before running the slash command



/ci-repeat 1
tests/rptest/tests/simple_e2e_test.py::SimpleEndToEndTest.test_consumer_interruption
tests/rptest/tests/controller_snapshot_test.py::ControllerSnapshotTest.test_join_restart_catch_up
tests/rptest/tests/maintenance_test.py::MaintenanceTest.test_maintenance_sticky@{"use_rpk":true}

vbotbuildovich avatar Apr 28 '25 16:04 vbotbuildovich

the failure is https://redpandadata.atlassian.net/browse/CORE-9229

bashtanov avatar Apr 28 '25 16:04 bashtanov

/ci-repeat 1 tests/rptest/tests/simple_e2e_test.py::SimpleEndToEndTest.test_consumer_interruption tests/rptest/tests/controller_snapshot_test.py::ControllerSnapshotTest.test_join_restart_catch_up tests/rptest/tests/maintenance_test.py::MaintenanceTest.test_maintenance_sticky@{"use_rpk":true}

travisdowns avatar Apr 29 '25 17:04 travisdowns

/ci-repeat 1 tests/rptest/tests/simple_e2e_test.py::SimpleEndToEndTest.test_consumer_interruption tests/rptest/tests/controller_snapshot_test.py::ControllerSnapshotTest.test_join_restart_catch_up tests/rptest/tests/maintenance_test.py::MaintenanceTest.test_maintenance_sticky@{"use_rpk":true}

bashtanov avatar Jun 02 '25 08:06 bashtanov

fixture test fix is here (and the problem is not related to the change)

bashtanov avatar Jun 04 '25 14:06 bashtanov

/ci-repeat 1

bashtanov avatar Jun 04 '25 14:06 bashtanov

Retry command for Build#66848

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/partition_force_reconfiguration_test.py::NodeWiseRecoveryTest.test_recovery_local_data_missing

vbotbuildovich avatar Jun 04 '25 21:06 vbotbuildovich

/ci-repeat 1

bashtanov avatar Jun 05 '25 09:06 bashtanov