redpanda
redpanda copied to clipboard
Reduce duration of partition_movement_test from 25min to 8min
Cover letter
- reduces duration of partition_movement_test from 25min to 8min
- introduce a workload generator / consistency checker to verified producer / consumers
The new verifier is based on the verifiers we use in chaos tests. The key features of this verifier:
- It's an app using a client instead of being a CLI kafka tool running in a bash loop (
KafProducer
) which results in more organic usage pattern. - Low distance between kafka client and the app:
app->driver
vsapp->kgo->driver
. The lower distance means it's easier to debug when something goes wrong. - It uses the most popular client (easier to distinguish client issue vs redpanda issue).
- It detects out of order and duplicated messages
- It already detected this problem https://github.com/redpanda-data/redpanda/pull/5356
- It treats timeouts as indecisive errors (not subject to #4702)
- It's an online verifier which uses end-to-end validation with a bounded amount of RAM
- It exposes http interface to control the workload and query statistics
- without bidirectional communication it's impossible to establish causality and to check progress at a given point of time
- http interface helps with clean shutdowns to avoid #4326
- It has an error log which is pulled after the test is over (to help with debugging the failing tests)
- We don't adopt external code (
VerifiableProducer
) so no licensing questions
Release notes
- none
@rystsov What's the status on this PR? Does it need review or are we waiting for some changes?
@NyaliaLui I've addressed the comments and it waits for review
Most of the changes focused on the test itself are fine, but I worry that we are repeating the same mistake of adding new workloads rather than extending/adapting existing ones.
@jcsp The new verifier follows a different approach to testing so extending the existing verifiers would require more work. Once this PR is merged I'll push for giving up on VerifiableProducer, VerifiableConsumer and KafProducer. I've described the key features of this verifier in the cover letter.
ci-failures:
- https://github.com/redpanda-data/redpanda/issues/4772
- https://github.com/redpanda-data/redpanda/issues/5589
- https://github.com/redpanda-data/redpanda/issues/5591
@rystsov looks like a few merge conflicts
BLOCKER: PartitionMoveInterruption.test_cancelling_partition_move https://github.com/redpanda-data/redpanda/issues/6167
EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures - https://github.com/redpanda-data/redpanda/issues/4639
PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade https://github.com/redpanda-data/redpanda/issues/5589
BLOCKER: PartitionBalancerTest.test_unavailable_nodes https://github.com/redpanda-data/redpanda/issues/6176
Y'all the PR was ready to merge a while ago, I fixed the conflicts and it's ready to go. IMHO we may do an exception and let it go in despite the blockers. They aren't caused by the PR and the PR introduces a verifier to solve the "Consumer failed to consume up to offsets" ci-failure. What do you think?
but it's not clear why "Reduce duration of partition_movement_test from 25min to 8min" this is true. Is there some insight into the time savings?
it was true back in summer but now the benefits should be smaller, the major boost came from optimizing wait_until which was put in a separate PR since them the second best boost came from using logic like:
- start continuous workload
- await n updates
- perform partition movement
- await n updates
- stop the workload
instead of:
- start workload & schedule N operations (where N is big enough)
- perform partition movement
- wait until the workload is stopped
I'll change the description to avoid the confusion
it was true back in summer but now the benefits should be smaller, the major boost came from optimizing wait_until
ahh thanks. i remember that wait_until pr now. thanks!