archival_stm snapshot: init _last_clean_at as -inf when applying a dirty snapshot
~This PR includes changes from https://github.com/redpanda-data/redpanda/pull/25728 for the test to pass, only two last commits are relevant.~
I made this change when debugging https://redpandadata.atlassian.net/browse/CORE-9506. The issue is when a topic is deleted quickly after creation the broker sometimes errors with Failed to fetch manifest during finalize()) as it expects to download partition manifest that has never been uploaded.
Questions to Storage team:
- Does the change make sense? I'm not familiar with the logic of this STM, e.g.
_last_dirty_offsetis not changed on snapshot application at all, is that right? - The test (last commit) passes but only if scrubbing is turned off. Should we aim for scrubbing to pass as well or is it expected to have some anomalies if we're recreating topics so wildly?
Backports Required
- [ ] none - not a bug fix
- [ ] none - this is a backport
- [ ] none - issue does not exist in previous branches
- [ ] none - papercut/not impactful enough to backport
- [ ] v25.1.x
- [ ] v24.3.x
- [ ] v24.2.x
- [ ] v24.1.x
Release Notes
@bashtanov is this still draft?
@nvartolomei I'm looking for a review indeed. But I'm rather after a discussion than for an approval as it is.
converting to ready-for-review for it to show up in people's to-review lists
CI test results
test results on build#64496
| test_id | test_kind | job_url | test_status | passed |
|---|---|---|---|---|
| rptest.tests.datalake.throttling_test.DatalakeThrottlingTest.test_basic_throttling.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.NESSIE | ducktape | https://buildkite.com/redpanda/redpanda/builds/64496#01963562-04e8-4bc2-bcbe-2065f70ce749 | FLAKY | 1/2 |
| rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False | ducktape | https://buildkite.com/redpanda/redpanda/builds/64496#01963562-04e8-429c-abdf-ee973e368c00 | FLAKY | 1/2 |
| rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False | ducktape | https://buildkite.com/redpanda/redpanda/builds/64496#01963575-8c91-4390-a8a0-210bd1073616 | FLAKY | 1/4 |
test results on build#65179
| test_id | test_kind | job_url | test_status | passed |
|---|---|---|---|---|
| rptest.tests.controller_snapshot_test.ControllerSnapshotTest.test_join_restart_catch_up | ducktape | https://buildkite.com/redpanda/redpanda/builds/65179#01967cec-98b2-40cc-922e-a949731c68a7 | FLAKY | 20/21 |
| rptest.tests.maintenance_test.MaintenanceTest.test_maintenance_sticky.use_rpk=True | ducktape | https://buildkite.com/redpanda/redpanda/builds/65179#01967cec-98b2-40cc-922e-a949731c68a7 | FLAKY | 20/21 |
| rptest.tests.simple_e2e_test.SimpleEndToEndTest.test_consumer_interruption | ducktape | https://buildkite.com/redpanda/redpanda/builds/65179#01967cec-98b1-42f8-82dd-c4f977fa6177 | FLAKY | 16/21 |
| rptest.tests.simple_e2e_test.SimpleEndToEndTest.test_consumer_interruption | ducktape | https://buildkite.com/redpanda/redpanda/builds/65179#01967cef-f00d-42f7-9935-55bd5d3c5043 | FLAKY | 10/21 |
| rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False | ducktape | https://buildkite.com/redpanda/redpanda/builds/65179#01967cef-f00e-4508-a817-a3d558bd3bcb | FLAKY | 15/21 |
test results on build#66848
| test_class | test_method | test_arguments | test_kind | job_url | test_status | passed | reason |
|---|---|---|---|---|---|---|---|
| DatalakeBatchingTest | test_batching | {"catalog_type": "rest_jdbc", "cloud_storage_type": 1, "expect_large_files": true, "query_engine": "spark"} | ducktape | https://buildkite.com/redpanda/redpanda/builds/66848#01973c21-0518-48d2-a2f5-ff3f647fcfcd | FLAKY | 20/21 | upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS |
| NodeWiseRecoveryTest | test_recovery_local_data_missing | ducktape | https://buildkite.com/redpanda/redpanda/builds/66848#01973c21-0518-48d2-a2f5-ff3f647fcfcd | FAIL | 0/1 | The test has failed across all retries | |
| PartitionReassignmentsTest | test_add_partitions_with_inprogress_reassignments | ducktape | https://buildkite.com/redpanda/redpanda/builds/66848#01973c1e-1a87-4dfe-83ce-fe8c607f6f54 | FLAKY | 19/21 | upstream reliability is '97.61904761904762'. current run reliability is '90.47619047619048'. drift is 7.14286 and the allowed drift is set to 50. The test should PASS |
test results on build#66917
| test_class | test_method | test_arguments | test_kind | job_url | test_status | passed | reason |
|---|---|---|---|---|---|---|---|
| NodePoolMigrationTest | test_migrating_redpanda_nodes_to_new_pool | {"balancing_mode": "node_add", "cleanup_policy": "compact,delete", "test_mode": "tiered_storage_fast_moves"} | ducktape | https://buildkite.com/redpanda/redpanda/builds/66917#01973f8e-a239-41a1-8c17-7784d3721ef9 | FLAKY | 19/21 | upstream reliability is '99.6'. current run reliability is '90.47619047619048'. drift is 9.12381 and the allowed drift is set to 50. The test should PASS |
| src/v/cluster/tests/tx_compaction_tests | src/v/cluster/tests/tx_compaction_tests | unit | https://buildkite.com/redpanda/redpanda/builds/66917#01973f5c-c0db-4d0e-8652-0175f88445e9 | FAIL | 0/1 |
@nvartolomei Thanks very much. I'm sorry I must have confused you with this PR. As stated in the description I'm mostly interested concerned about the last 2 commits, the penultimate one in particular. In any case I'll address your comments, thanks again!
I think that we can detect a situation when the manifest was never uploaded by checking the error code. If S3 GET request returned NoSuchKey error it's guaranteed that the manifest was never uploaded which means that the partition should be empty and there is nothing to finalise.
Well. We don't want to rely on that not found error where we can as it might be mistaken. Furthermore, that error is the reason why we found the bug leading to this PR and to the linked one https://github.com/redpanda-data/redpanda/pull/25728. We were creating STMs with incorrect configs...
We don't want to rely on that not found error where we can as it might be mistaken.
How exactly it could be mistaken? You basically running a database query (S3 API request) and it tells you that there is no such key. This can only happen if there is indeed no such key. We're relying on this during recovery process I believe. I'd prefer to have clean-at/dirty-at in the snapshot for sure but it might be tricky to do this in the minor release.
How exactly it could be mistaken?
We are requesting the wrong path. This is what happens in the buggy scenario we currently have. The fact that we can leverage stm information (except when it is fresh from a snapshot) to inform whether not found is expected or not allows us to prevent/error on such bugs. This is quite valuable it turns out. We caught a pretty bad bug by having such double checks.
What's the state of this PR @bashtanov? Seems like it might be necessary for CI health at the moment
:tada: Snyk checks have passed. No issues have been found so far.
:white_check_mark: security/snyk check is complete. No issues have been found. (View Details)
:white_check_mark: license/snyk check is complete. No issues have been found. (View Details)
:tada: Snyk checks have passed. No issues have been found so far.
:white_check_mark: security/snyk check is complete. No issues have been found. (View Details)
:white_check_mark: license/snyk check is complete. No issues have been found. (View Details)
@WillemKauf the commits not related to the subject change have been merged in https://github.com/redpanda-data/redpanda/pull/25728, this should already improve tests stability but unfortunately will not eliminate failures completely.
@Lazin @nvartolomei I added saving and restoring of last clean and dirty offsets into local snapshots. It looks straightforward thanks to serde structure backwards compatibility mechanism, but we still need to fall back to default values if they are missing in the snapshot.
Retry command for Build#65179
please wait until all jobs are finished before running the slash command
/ci-repeat 1
tests/rptest/tests/simple_e2e_test.py::SimpleEndToEndTest.test_consumer_interruption
tests/rptest/tests/controller_snapshot_test.py::ControllerSnapshotTest.test_join_restart_catch_up
tests/rptest/tests/maintenance_test.py::MaintenanceTest.test_maintenance_sticky@{"use_rpk":true}
the failure is https://redpandadata.atlassian.net/browse/CORE-9229
/ci-repeat 1 tests/rptest/tests/simple_e2e_test.py::SimpleEndToEndTest.test_consumer_interruption tests/rptest/tests/controller_snapshot_test.py::ControllerSnapshotTest.test_join_restart_catch_up tests/rptest/tests/maintenance_test.py::MaintenanceTest.test_maintenance_sticky@{"use_rpk":true}
/ci-repeat 1 tests/rptest/tests/simple_e2e_test.py::SimpleEndToEndTest.test_consumer_interruption tests/rptest/tests/controller_snapshot_test.py::ControllerSnapshotTest.test_join_restart_catch_up tests/rptest/tests/maintenance_test.py::MaintenanceTest.test_maintenance_sticky@{"use_rpk":true}
fixture test fix is here (and the problem is not related to the change)
/ci-repeat 1
Retry command for Build#66848
please wait until all jobs are finished before running the slash command
/ci-repeat 1
tests/rptest/tests/partition_force_reconfiguration_test.py::NodeWiseRecoveryTest.test_recovery_local_data_missing
/ci-repeat 1