redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

[CORE-13254] ct: calculate global gc eligible L0 epoch

Open dotnwat opened this issue 2 months ago • 2 comments

Computes the top-level safe to delete L0 GC epoch value. There is a very simple ducktape test that verifies that deletes are occuring. I want to get some better testing into the ducktap setup for 25.3.x. I struggled a lot with figuring out how to get a fixture test working, so that's also on the todo list, but may not really be necessary for 25.3.x--Oren is going to take over getting this across the GA line.

Fixes: https://redpandadata.atlassian.net/browse/CORE-13254 Fixes: https://redpandadata.atlassian.net/browse/CORE-14824

Backports Required

  • [ ] none - not a bug fix
  • [ ] none - this is a backport
  • [ ] none - issue does not exist in previous branches
  • [ ] none - papercut/not impactful enough to backport
  • [x] v25.3.x
  • [ ] v25.2.x
  • [ ] v25.1.x
  • [ ] v24.3.x

Release Notes

  • none

dotnwat avatar Nov 05 '25 01:11 dotnwat

CI test results

test results on build#75616
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkingReplicationTests test_auto_prefix_trimming {"source_cluster_spec": {"cluster_type": "redpanda"}, "with_failures": true} integration https://buildkite.com/redpanda/redpanda/builds/75616#019a51ce-6be0-47e6-aece-772084f8160d FLAKY 20/21 upstream reliability is '92.46298788694482'. current run reliability is '95.23809523809523'. drift is -2.77511 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_auto_prefix_trimming
ConsumerGroupBalancingTest test_coordinator_nodes_balance null integration https://buildkite.com/redpanda/redpanda/builds/75616#019a520a-46e7-43b8-9832-ca8dc38fda11 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ConsumerGroupBalancingTest&test_method=test_coordinator_nodes_balance
MountUnmountIcebergTest test_simple_remount {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/75616#019a520a-46e7-43b8-9832-ca8dc38fda11 FLAKY 17/21 upstream reliability is '97.68421052631578'. current run reliability is '80.95238095238095'. drift is 16.73183 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount
SegmentMsTest test_segment_rolling_with_retention_consumer null integration https://buildkite.com/redpanda/redpanda/builds/75616#019a520a-46e0-4f42-95ac-5a05a7aad866 FLAKY 16/21 upstream reliability is '94.73039215686273'. current run reliability is '76.19047619047619'. drift is 18.53992 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SegmentMsTest&test_method=test_segment_rolling_with_retention_consumer
ShadowLinkingRandomOpsTest test_node_operations {"failures": false} integration https://buildkite.com/redpanda/redpanda/builds/75616#019a520a-46e5-469d-b45a-9f86c076b39e FLAKY 19/21 upstream reliability is '99.69183359013869'. current run reliability is '90.47619047619048'. drift is 9.21564 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
ShadowLinkingRandomOpsTest test_node_operations {"failures": true} integration https://buildkite.com/redpanda/redpanda/builds/75616#019a520a-46e7-43b8-9832-ca8dc38fda11 FLAKY 16/21 upstream reliability is '100.0'. current run reliability is '76.19047619047619'. drift is 23.80952 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
TxUpgradeCompactionTest upgrade_with_compaction_test null integration https://buildkite.com/redpanda/redpanda/builds/75616#019a520a-46e4-43cb-92f6-79e8606ebb10 FLAKY 20/21 upstream reliability is '99.3006993006993'. current run reliability is '95.23809523809523'. drift is 4.0626 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TxUpgradeCompactionTest&test_method=upgrade_with_compaction_test
src/v/storage/tests/storage_e2e_fixture_test src/v/storage/tests/storage_e2e_fixture_test unit https://buildkite.com/redpanda/redpanda/builds/75616#019a51b3-efb8-46e8-beee-bb3fc6525973 FAIL 0/1
test results on build#77292
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
PartitionBalancerTest test_unavailable_nodes null integration https://buildkite.com/redpanda/redpanda/builds/77292#019ae6ba-f6f3-41e8-aef3-6bcc290b5bbd FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=PartitionBalancerTest&test_method=test_unavailable_nodes
test results on build#77379
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkingReplicationTests test_topic_delete {"source_cluster_spec": {"cluster_type": "redpanda"}} integration https://buildkite.com/redpanda/redpanda/builds/77379#019aecf0-a435-4228-bd79-047f44788f61 FLAKY 14/21 upstream reliability is '100.0'. current run reliability is '66.66666666666666'. drift is 33.33333 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_topic_delete
NodesDecommissioningTest test_decommissioning_rebalancing_node {"shutdown_decommissioned": false} integration https://buildkite.com/redpanda/redpanda/builds/77379#019aecf0-a42f-4f66-97db-0524ba4827e7 FLAKY 19/21 upstream reliability is '93.16239316239316'. current run reliability is '90.47619047619048'. drift is 2.6862 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommissioning_rebalancing_node
test results on build#77607
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ReplicatedMetastoreTest TestBasicRemoveTopics unit https://buildkite.com/redpanda/redpanda/builds/77607#019b04f7-e94d-4b1e-9f15-4fa9daac43bd FAIL 0/1
ReplicatedMetastoreTest TestBasicRemoveTopics unit https://buildkite.com/redpanda/redpanda/builds/77607#019b056f-20b7-4045-a77f-c2662d2e9e1c FAIL 0/1
ScalingUpTest test_fast_node_addition null integration https://buildkite.com/redpanda/redpanda/builds/77607#019b053b-7691-400f-a3da-8fd35afccb96 FLAKY 29/31 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0292, p0=0.5889, reject_threshold=0.0100. adj_baseline=0.0851, p1=0.2632, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ScalingUpTest&test_method=test_fast_node_addition
test results on build#77631
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
NodesDecommissioningTest test_decommissioning_rebalancing_node {"shutdown_decommissioned": false} integration https://buildkite.com/redpanda/redpanda/builds/77631#019b067d-b219-4fcc-afc4-1e79670ba2b4 FLAKY 12/21 Test FAILS after retries.Significant increase in flaky rate(baseline=0.1046, p0=0.0006, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommissioning_rebalancing_node
WriteCachingFailureInjectionTest test_unavoidable_data_loss null integration https://buildkite.com/redpanda/redpanda/builds/77631#019b067d-b216-4018-a739-ea109bb608db FLAKY 18/21 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0549, p0=0.3012, reject_threshold=0.0100. adj_baseline=0.1558, p1=0.3771, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionTest&test_method=test_unavoidable_data_loss

vbotbuildovich avatar Nov 05 '25 05:11 vbotbuildovich

Retry command for Build#77631

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/nodes_decommissioning_test.py::NodesDecommissioningTest.test_decommissioning_rebalancing_node@{"shutdown_decommissioned":false}

vbotbuildovich avatar Dec 10 '25 05:12 vbotbuildovich

/backport v25.3.x

vbotbuildovich avatar Dec 10 '25 21:12 vbotbuildovich

@Lazin

I think we need to check if the snapshot that we created by looking at topic table is actually consistent. The controller offset of the snapshot is the upper bound for the GC epoch. Do we enforce this invariant somewhere?

The belief is that it is consistent because updates to the topics table are themselves commands in the controller log. But I agree, we should have more constraints. One simple thing to do is sample before and after taking the snapshot and makes ure it didn't change. Kinda like a sequence lock?

dotnwat avatar Dec 11 '25 01:12 dotnwat