Deflake PartitionReassignmentsTest
Backports Required
- [x] none - not a bug fix
- [ ] none - this is a backport
- [ ] none - issue does not exist in previous branches
- [ ] none - papercut/not impactful enough to backport
- [ ] v25.1.x
- [ ] v24.3.x
- [ ] v24.2.x
Release Notes
Improvements
Deflakes PartitionReassignmentsTest.test_add_partitions_with_inprogress_reassignments
This test was racing the partition balancer to initiate a reassignment on all test partitions. This test only requires that all partitions be currently reassigning to perform its function.
The fix is to allow the test to recognize and use prior reassignments by permitting REASSIGNMENT_IN_PROGRESS errors in the alter partitions client call
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.
Retry command for Build#67297
please wait until all jobs are finished before running the slash command
/ci-repeat 1
tests/rptest/tests/partition_reassignments_test.py::PartitionReassignmentsTest.test_reassignments
CI test results
test results on build#67297
| test_class | test_method | test_arguments | test_kind | job_url | test_status | passed | reason |
|---|---|---|---|---|---|---|---|
| PartitionBalancerTest | test_fuzz_admin_ops | ducktape | https://buildkite.com/redpanda/redpanda/builds/67297#019766ea-b463-40bd-99e1-ecb473ad10e5 | FLAKY | 20/21 | upstream reliability is '96.05263157894737'. current run reliability is '95.23809523809523'. drift is 0.81454 and the allowed drift is set to 50. The test should PASS | |
| PartitionReassignmentsTest | test_reassignments | ducktape | https://buildkite.com/redpanda/redpanda/builds/67297#019766f8-513f-4fe8-bd0f-698724f8feba | FAIL | 0/21 | The test has failed across all retries | |
| TopicDeleteCloudStorageTest | drop_lifecycle_marker_test | {"cloud_storage_type": 2} | ducktape | https://buildkite.com/redpanda/redpanda/builds/67297#019766f8-513f-4fe8-bd0f-698724f8feba | FLAKY | 17/21 | upstream reliability is '100.0'. current run reliability is '80.95238095238095'. drift is 19.04762 and the allowed drift is set to 50. The test should PASS |
test results on build#67351
| test_class | test_method | test_arguments | test_kind | job_url | test_status | passed | reason |
|---|---|---|---|---|---|---|---|
| MaintenanceTest | test_maintenance_sticky | {"use_rpk": false} | ducktape | https://buildkite.com/redpanda/redpanda/builds/67351#01976ba2-2a18-4079-b234-f8e66b2b1b83 | FLAKY | 19/21 | upstream reliability is '96.46017699115043'. current run reliability is '90.47619047619048'. drift is 5.98399 and the allowed drift is set to 50. The test should PASS |
| RandomNodeOperationsTest | test_node_operations | {"cloud_storage_type": 1, "compaction_mode": "chunked_sliding_window", "enable_failures": true, "mixed_versions": true, "with_iceberg": false} | ducktape | https://buildkite.com/redpanda/redpanda/builds/67351#01976ba2-2a19-4f56-a7af-cdd478250df4 | FLAKY | 20/21 | upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS |
| src/v/crypto/tests/crypto_bench_rpbench_test | src/v/crypto/tests/crypto_bench_rpbench_test | unit | https://buildkite.com/redpanda/redpanda/builds/67351#01976b6f-5f13-4e0c-abed-c73022a131dc | FAIL | 0/1 |
test results on build#67411
| test_class | test_method | test_arguments | test_kind | job_url | test_status | passed | reason |
|---|---|---|---|---|---|---|---|
| ConsumerOffsetsRecoveryTest | test_consumer_offsets_partition_recovery | ducktape | https://buildkite.com/redpanda/redpanda/builds/67411#01977ad6-e278-45f0-ae3f-6fd48df40c32 | FLAKY | 19/21 | upstream reliability is '97.5'. current run reliability is '90.47619047619048'. drift is 7.02381 and the allowed drift is set to 50. The test should PASS | |
| RaftAvailabilityTest | test_controller_node_isolation | ducktape | https://buildkite.com/redpanda/redpanda/builds/67411#01977ad6-e279-43c0-82c9-3176367cc5ab | FLAKY | 20/21 | upstream reliability is '94.82758620689656'. current run reliability is '95.23809523809523'. drift is -0.41051 and the allowed drift is set to 50. The test should PASS | |
| RandomNodeOperationsTest | test_node_operations | {"cloud_storage_type": 2, "compaction_mode": "sliding_window", "enable_failures": true, "mixed_versions": true, "with_iceberg": false} | ducktape | https://buildkite.com/redpanda/redpanda/builds/67411#01977ad6-e278-45f0-ae3f-6fd48df40c32 | FLAKY | 20/21 | upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS |
| DisablingPartitionsTest | test_disable | ducktape | https://buildkite.com/redpanda/redpanda/builds/67411#01977af3-7e92-434b-9f59-cdeae76cd812 | FLAKY | 16/21 | upstream reliability is '94.00428265524626'. current run reliability is '76.19047619047619'. drift is 17.81381 and the allowed drift is set to 50. The test should PASS |
test results on build#67577
| test_class | test_method | test_arguments | test_kind | job_url | test_status | passed | reason |
|---|---|---|---|---|---|---|---|
| IcebergUsageTest | test_iceberg_usage | {"catalog_type": "rest_hadoop", "cloud_storage_type": 1, "query_engine": "spark"} | ducktape | https://buildkite.com/redpanda/redpanda/builds/67577#0197847b-1fa1-4045-b033-4c4c19cf1e58 | FLAKY | 16/21 | upstream reliability is '84.5'. current run reliability is '76.19047619047619'. drift is 8.30952 and the allowed drift is set to 50. The test should PASS |
| TopicDeleteCloudStorageTest | drop_lifecycle_marker_test | {"cloud_storage_type": 1} | ducktape | https://buildkite.com/redpanda/redpanda/builds/67577#0197847b-1fa1-4045-b033-4c4c19cf1e58 | FLAKY | 20/21 | upstream reliability is '98.09069212410502'. current run reliability is '95.23809523809523'. drift is 2.8526 and the allowed drift is set to 50. The test should PASS |
test results on build#67665
| test_class | test_method | test_arguments | test_kind | job_url | test_status | passed | reason |
|---|---|---|---|---|---|---|---|
| IcebergUsageTest | test_iceberg_usage | {"catalog_type": "rest_hadoop", "cloud_storage_type": 1, "query_engine": "spark"} | ducktape | https://buildkite.com/redpanda/redpanda/builds/67665#01978e57-2e83-4401-b216-122d037bc37e | FLAKY | 19/21 | upstream reliability is '85.3035143769968'. current run reliability is '90.47619047619048'. drift is -5.17268 and the allowed drift is set to 50. The test should PASS |
| TxAtomicProduceConsumeTest | test_basic_tx_consumer_transform_produce | {"with_failures": true} | ducktape | https://buildkite.com/redpanda/redpanda/builds/67665#01978e57-2e83-4401-b216-122d037bc37e | FLAKY | 20/21 | upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS |
After this change merged 10 days ago the test will retry the partition reassignment if clashing with partition balancer. From what I can see in one of the test logs there are concurrent reassignment(s) in progress for 10 seconds. That sounds a bit too long to me, but to be on the safe side I'd maybe give it 30 seconds as partition balancer may need to move it multiple times to achieve a stable state. @ztlpn might give a better advice on how long we expect it to take. If it still reproduces I'd investigate whether it's actually partition balancer (did you look in broker logs?), and why it takes so long. If partition balancer won't stop moving it then it's a bug, although not the kind of a bug this test is checking.
This test only requires that all partitions be currently reassigning to perform its function.
That's true. However, the partition balancer action may be almost done when we attempt a manual reassignment, and when we add partitions it's complete. This will make the test fail. Have you tried running it 100-1000 times with your change to see if it is stable?
Please split into appropriately annotated commits as per https://github.com/redpanda-data/redpanda/blob/dev/CONTRIBUTING.md#commit-history
Please split into appropriately annotated commits as per https://github.com/redpanda-data/redpanda/blob/dev/CONTRIBUTING.md#commit-history
squash merged, should be fixed
please prefix the commit message with the area it is related with, in this case it will be tests: ...
/ci-repeat 1 tests/rptest/tests/partition_reassignments_test.py::PartitionReassignmentsTest.test_reassignments