redpanda archival_stm snapshot: init _last_clean_at as -inf when applying a dirty snapshot

~This PR includes changes from https://github.com/redpanda-data/redpanda/pull/25728 for the test to pass, only two last commits are relevant.~

I made this change when debugging https://redpandadata.atlassian.net/browse/CORE-9506. The issue is when a topic is deleted quickly after creation the broker sometimes errors with Failed to fetch manifest during finalize()) as it expects to download partition manifest that has never been uploaded.

Questions to Storage team:

Does the change make sense? I'm not familiar with the logic of this STM, e.g. _last_dirty_offset is not changed on snapshot application at all, is that right?
The test (last commit) passes but only if scrubbing is turned off. Should we aim for scrubbing to pass as well or is it expected to have some anomalies if we're recreating topics so wildly?

Backports Required

[ ] none - not a bug fix
[ ] none - this is a backport
[ ] none - issue does not exist in previous branches
[ ] none - papercut/not impactful enough to backport
[ ] v25.1.x
[ ] v24.3.x
[ ] v24.2.x
[ ] v24.1.x

Release Notes

Apr 11 '25 14:04 bashtanov

@bashtanov is this still draft?

Apr 14 '25 15:04 nvartolomei

@nvartolomei I'm looking for a review indeed. But I'm rather after a discussion than for an approval as it is.

Apr 14 '25 15:04 bashtanov

converting to ready-for-review for it to show up in people's to-review lists

Apr 14 '25 16:04 bashtanov

CI test results

test results on build#64496

test_id	test_kind	job_url	test_status	passed
rptest.tests.datalake.throttling_test.DatalakeThrottlingTest.test_basic_throttling.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.NESSIE	ducktape	https://buildkite.com/redpanda/redpanda/builds/64496#01963562-04e8-4bc2-bcbe-2065f70ce749	FLAKY	1/2
rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/64496#01963562-04e8-429c-abdf-ee973e368c00	FLAKY	1/2
rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/64496#01963575-8c91-4390-a8a0-210bd1073616	FLAKY	1/4

test results on build#65179

test_id	test_kind	job_url	test_status	passed
rptest.tests.controller_snapshot_test.ControllerSnapshotTest.test_join_restart_catch_up	ducktape	https://buildkite.com/redpanda/redpanda/builds/65179#01967cec-98b2-40cc-922e-a949731c68a7	FLAKY	20/21
rptest.tests.maintenance_test.MaintenanceTest.test_maintenance_sticky.use_rpk=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/65179#01967cec-98b2-40cc-922e-a949731c68a7	FLAKY	20/21
rptest.tests.simple_e2e_test.SimpleEndToEndTest.test_consumer_interruption	ducktape	https://buildkite.com/redpanda/redpanda/builds/65179#01967cec-98b1-42f8-82dd-c4f977fa6177	FLAKY	16/21
rptest.tests.simple_e2e_test.SimpleEndToEndTest.test_consumer_interruption	ducktape	https://buildkite.com/redpanda/redpanda/builds/65179#01967cef-f00d-42f7-9935-55bd5d3c5043	FLAKY	10/21
rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/65179#01967cef-f00e-4508-a817-a3d558bd3bcb	FLAKY	15/21

test results on build#66848

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason
DatalakeBatchingTest	test_batching	{"catalog_type": "rest_jdbc", "cloud_storage_type": 1, "expect_large_files": true, "query_engine": "spark"}	ducktape	https://buildkite.com/redpanda/redpanda/builds/66848#01973c21-0518-48d2-a2f5-ff3f647fcfcd	FLAKY	20/21	upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
NodeWiseRecoveryTest	test_recovery_local_data_missing		ducktape	https://buildkite.com/redpanda/redpanda/builds/66848#01973c21-0518-48d2-a2f5-ff3f647fcfcd	FAIL	0/1	The test has failed across all retries
PartitionReassignmentsTest	test_add_partitions_with_inprogress_reassignments		ducktape	https://buildkite.com/redpanda/redpanda/builds/66848#01973c1e-1a87-4dfe-83ce-fe8c607f6f54	FLAKY	19/21	upstream reliability is '97.61904761904762'. current run reliability is '90.47619047619048'. drift is 7.14286 and the allowed drift is set to 50. The test should PASS

test results on build#66917

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason
NodePoolMigrationTest	test_migrating_redpanda_nodes_to_new_pool	{"balancing_mode": "node_add", "cleanup_policy": "compact,delete", "test_mode": "tiered_storage_fast_moves"}	ducktape	https://buildkite.com/redpanda/redpanda/builds/66917#01973f8e-a239-41a1-8c17-7784d3721ef9	FLAKY	19/21	upstream reliability is '99.6'. current run reliability is '90.47619047619048'. drift is 9.12381 and the allowed drift is set to 50. The test should PASS
src/v/cluster/tests/tx_compaction_tests	src/v/cluster/tests/tx_compaction_tests		unit	https://buildkite.com/redpanda/redpanda/builds/66917#01973f5c-c0db-4d0e-8652-0175f88445e9	FAIL	0/1

Apr 14 '25 20:04 vbotbuildovich

@nvartolomei Thanks very much. I'm sorry I must have confused you with this PR. As stated in the description I'm mostly interested concerned about the last 2 commits, the penultimate one in particular. In any case I'll address your comments, thanks again!

Apr 15 '25 16:04 bashtanov

I think that we can detect a situation when the manifest was never uploaded by checking the error code. If S3 GET request returned NoSuchKey error it's guaranteed that the manifest was never uploaded which means that the partition should be empty and there is nothing to finalise.

Well. We don't want to rely on that not found error where we can as it might be mistaken. Furthermore, that error is the reason why we found the bug leading to this PR and to the linked one https://github.com/redpanda-data/redpanda/pull/25728. We were creating STMs with incorrect configs...

Apr 16 '25 14:04 nvartolomei

We don't want to rely on that not found error where we can as it might be mistaken.

How exactly it could be mistaken? You basically running a database query (S3 API request) and it tells you that there is no such key. This can only happen if there is indeed no such key. We're relying on this during recovery process I believe. I'd prefer to have clean-at/dirty-at in the snapshot for sure but it might be tricky to do this in the minor release.

Apr 16 '25 14:04 Lazin

How exactly it could be mistaken?

We are requesting the wrong path. This is what happens in the buggy scenario we currently have. The fact that we can leverage stm information (except when it is fresh from a snapshot) to inform whether not found is expected or not allows us to prevent/error on such bugs. This is quite valuable it turns out. We caught a pretty bad bug by having such double checks.

Apr 16 '25 16:04 nvartolomei

What's the state of this PR @bashtanov? Seems like it might be necessary for CI health at the moment

Apr 25 '25 18:04 WillemKauf

:tada: Snyk checks have passed. No issues have been found so far.

:white_check_mark: security/snyk check is complete. No issues have been found. (View Details)

:white_check_mark: license/snyk check is complete. No issues have been found. (View Details)

Apr 28 '25 11:04 secpanda

:tada: Snyk checks have passed. No issues have been found so far.

:white_check_mark: security/snyk check is complete. No issues have been found. (View Details)

:white_check_mark: license/snyk check is complete. No issues have been found. (View Details)

Apr 28 '25 11:04 secpanda

@WillemKauf the commits not related to the subject change have been merged in https://github.com/redpanda-data/redpanda/pull/25728, this should already improve tests stability but unfortunately will not eliminate failures completely.

@Lazin @nvartolomei I added saving and restoring of last clean and dirty offsets into local snapshots. It looks straightforward thanks to serde structure backwards compatibility mechanism, but we still need to fall back to default values if they are missing in the snapshot.

Apr 28 '25 13:04 bashtanov

Retry command for Build#65179

please wait until all jobs are finished before running the slash command



/ci-repeat 1
tests/rptest/tests/simple_e2e_test.py::SimpleEndToEndTest.test_consumer_interruption
tests/rptest/tests/controller_snapshot_test.py::ControllerSnapshotTest.test_join_restart_catch_up
tests/rptest/tests/maintenance_test.py::MaintenanceTest.test_maintenance_sticky@{"use_rpk":true}

Apr 28 '25 16:04 vbotbuildovich

the failure is https://redpandadata.atlassian.net/browse/CORE-9229

Apr 28 '25 16:04 bashtanov

/ci-repeat 1 tests/rptest/tests/simple_e2e_test.py::SimpleEndToEndTest.test_consumer_interruption tests/rptest/tests/controller_snapshot_test.py::ControllerSnapshotTest.test_join_restart_catch_up tests/rptest/tests/maintenance_test.py::MaintenanceTest.test_maintenance_sticky@{"use_rpk":true}

Apr 29 '25 17:04 travisdowns

/ci-repeat 1 tests/rptest/tests/simple_e2e_test.py::SimpleEndToEndTest.test_consumer_interruption tests/rptest/tests/controller_snapshot_test.py::ControllerSnapshotTest.test_join_restart_catch_up tests/rptest/tests/maintenance_test.py::MaintenanceTest.test_maintenance_sticky@{"use_rpk":true}

Jun 02 '25 08:06 bashtanov

fixture test fix is here (and the problem is not related to the change)

Jun 04 '25 14:06 bashtanov

/ci-repeat 1

Jun 04 '25 14:06 bashtanov

Retry command for Build#66848

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/partition_force_reconfiguration_test.py::NodeWiseRecoveryTest.test_recovery_local_data_missing

Jun 04 '25 21:06 vbotbuildovich

/ci-repeat 1

Jun 05 '25 09:06 bashtanov