redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

Deflake PartitionReassignmentsTest

Open joe-redpanda opened this issue 6 months ago • 6 comments

Backports Required

  • [x] none - not a bug fix
  • [ ] none - this is a backport
  • [ ] none - issue does not exist in previous branches
  • [ ] none - papercut/not impactful enough to backport
  • [ ] v25.1.x
  • [ ] v24.3.x
  • [ ] v24.2.x

Release Notes

Improvements

Deflakes PartitionReassignmentsTest.test_add_partitions_with_inprogress_reassignments

This test was racing the partition balancer to initiate a reassignment on all test partitions. This test only requires that all partitions be currently reassigning to perform its function.

The fix is to allow the test to recognize and use prior reassignments by permitting REASSIGNMENT_IN_PROGRESS errors in the alter partitions client call

joe-redpanda avatar Jun 13 '25 00:06 joe-redpanda

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Jun 13 '25 00:06 CLAassistant

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Jun 13 '25 00:06 CLAassistant

Retry command for Build#67297

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/partition_reassignments_test.py::PartitionReassignmentsTest.test_reassignments

vbotbuildovich avatar Jun 13 '25 04:06 vbotbuildovich

CI test results

test results on build#67297
test_class test_method test_arguments test_kind job_url test_status passed reason
PartitionBalancerTest test_fuzz_admin_ops ducktape https://buildkite.com/redpanda/redpanda/builds/67297#019766ea-b463-40bd-99e1-ecb473ad10e5 FLAKY 20/21 upstream reliability is '96.05263157894737'. current run reliability is '95.23809523809523'. drift is 0.81454 and the allowed drift is set to 50. The test should PASS
PartitionReassignmentsTest test_reassignments ducktape https://buildkite.com/redpanda/redpanda/builds/67297#019766f8-513f-4fe8-bd0f-698724f8feba FAIL 0/21 The test has failed across all retries
TopicDeleteCloudStorageTest drop_lifecycle_marker_test {"cloud_storage_type": 2} ducktape https://buildkite.com/redpanda/redpanda/builds/67297#019766f8-513f-4fe8-bd0f-698724f8feba FLAKY 17/21 upstream reliability is '100.0'. current run reliability is '80.95238095238095'. drift is 19.04762 and the allowed drift is set to 50. The test should PASS
test results on build#67351
test_class test_method test_arguments test_kind job_url test_status passed reason
MaintenanceTest test_maintenance_sticky {"use_rpk": false} ducktape https://buildkite.com/redpanda/redpanda/builds/67351#01976ba2-2a18-4079-b234-f8e66b2b1b83 FLAKY 19/21 upstream reliability is '96.46017699115043'. current run reliability is '90.47619047619048'. drift is 5.98399 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "compaction_mode": "chunked_sliding_window", "enable_failures": true, "mixed_versions": true, "with_iceberg": false} ducktape https://buildkite.com/redpanda/redpanda/builds/67351#01976ba2-2a19-4f56-a7af-cdd478250df4 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
src/v/crypto/tests/crypto_bench_rpbench_test src/v/crypto/tests/crypto_bench_rpbench_test unit https://buildkite.com/redpanda/redpanda/builds/67351#01976b6f-5f13-4e0c-abed-c73022a131dc FAIL 0/1
test results on build#67411
test_class test_method test_arguments test_kind job_url test_status passed reason
ConsumerOffsetsRecoveryTest test_consumer_offsets_partition_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/67411#01977ad6-e278-45f0-ae3f-6fd48df40c32 FLAKY 19/21 upstream reliability is '97.5'. current run reliability is '90.47619047619048'. drift is 7.02381 and the allowed drift is set to 50. The test should PASS
RaftAvailabilityTest test_controller_node_isolation ducktape https://buildkite.com/redpanda/redpanda/builds/67411#01977ad6-e279-43c0-82c9-3176367cc5ab FLAKY 20/21 upstream reliability is '94.82758620689656'. current run reliability is '95.23809523809523'. drift is -0.41051 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "compaction_mode": "sliding_window", "enable_failures": true, "mixed_versions": true, "with_iceberg": false} ducktape https://buildkite.com/redpanda/redpanda/builds/67411#01977ad6-e278-45f0-ae3f-6fd48df40c32 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
DisablingPartitionsTest test_disable ducktape https://buildkite.com/redpanda/redpanda/builds/67411#01977af3-7e92-434b-9f59-cdeae76cd812 FLAKY 16/21 upstream reliability is '94.00428265524626'. current run reliability is '76.19047619047619'. drift is 17.81381 and the allowed drift is set to 50. The test should PASS
test results on build#67577
test_class test_method test_arguments test_kind job_url test_status passed reason
IcebergUsageTest test_iceberg_usage {"catalog_type": "rest_hadoop", "cloud_storage_type": 1, "query_engine": "spark"} ducktape https://buildkite.com/redpanda/redpanda/builds/67577#0197847b-1fa1-4045-b033-4c4c19cf1e58 FLAKY 16/21 upstream reliability is '84.5'. current run reliability is '76.19047619047619'. drift is 8.30952 and the allowed drift is set to 50. The test should PASS
TopicDeleteCloudStorageTest drop_lifecycle_marker_test {"cloud_storage_type": 1} ducktape https://buildkite.com/redpanda/redpanda/builds/67577#0197847b-1fa1-4045-b033-4c4c19cf1e58 FLAKY 20/21 upstream reliability is '98.09069212410502'. current run reliability is '95.23809523809523'. drift is 2.8526 and the allowed drift is set to 50. The test should PASS
test results on build#67665
test_class test_method test_arguments test_kind job_url test_status passed reason
IcebergUsageTest test_iceberg_usage {"catalog_type": "rest_hadoop", "cloud_storage_type": 1, "query_engine": "spark"} ducktape https://buildkite.com/redpanda/redpanda/builds/67665#01978e57-2e83-4401-b216-122d037bc37e FLAKY 19/21 upstream reliability is '85.3035143769968'. current run reliability is '90.47619047619048'. drift is -5.17268 and the allowed drift is set to 50. The test should PASS
TxAtomicProduceConsumeTest test_basic_tx_consumer_transform_produce {"with_failures": true} ducktape https://buildkite.com/redpanda/redpanda/builds/67665#01978e57-2e83-4401-b216-122d037bc37e FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS

vbotbuildovich avatar Jun 13 '25 05:06 vbotbuildovich

After this change merged 10 days ago the test will retry the partition reassignment if clashing with partition balancer. From what I can see in one of the test logs there are concurrent reassignment(s) in progress for 10 seconds. That sounds a bit too long to me, but to be on the safe side I'd maybe give it 30 seconds as partition balancer may need to move it multiple times to achieve a stable state. @ztlpn might give a better advice on how long we expect it to take. If it still reproduces I'd investigate whether it's actually partition balancer (did you look in broker logs?), and why it takes so long. If partition balancer won't stop moving it then it's a bug, although not the kind of a bug this test is checking.

This test only requires that all partitions be currently reassigning to perform its function.

That's true. However, the partition balancer action may be almost done when we attempt a manual reassignment, and when we add partitions it's complete. This will make the test fail. Have you tried running it 100-1000 times with your change to see if it is stable?

bashtanov avatar Jun 13 '25 11:06 bashtanov

Please split into appropriately annotated commits as per https://github.com/redpanda-data/redpanda/blob/dev/CONTRIBUTING.md#commit-history

bashtanov avatar Jun 16 '25 08:06 bashtanov

Please split into appropriately annotated commits as per https://github.com/redpanda-data/redpanda/blob/dev/CONTRIBUTING.md#commit-history

squash merged, should be fixed

joe-redpanda avatar Jun 16 '25 21:06 joe-redpanda

please prefix the commit message with the area it is related with, in this case it will be tests: ...

mmaslankaprv avatar Jun 17 '25 06:06 mmaslankaprv

/ci-repeat 1 tests/rptest/tests/partition_reassignments_test.py::PartitionReassignmentsTest.test_reassignments

joe-redpanda avatar Jun 17 '25 21:06 joe-redpanda