redpanda Reduce duration of partition_movement

Reduce duration of partition_movement_test from 25min to 8min

Open rystsov opened this issue 2 years ago • 5 comments

Cover letter

reduces duration of partition_movement_test from 25min to 8min
introduce a workload generator / consistency checker to verified producer / consumers

The new verifier is based on the verifiers we use in chaos tests. The key features of this verifier:

It's an app using a client instead of being a CLI kafka tool running in a bash loop (KafProducer) which results in more organic usage pattern.
Low distance between kafka client and the app: app->driver vs app->kgo->driver. The lower distance means it's easier to debug when something goes wrong.
It uses the most popular client (easier to distinguish client issue vs redpanda issue).
It detects out of order and duplicated messages
It already detected this problem https://github.com/redpanda-data/redpanda/pull/5356
It treats timeouts as indecisive errors (not subject to #4702)
It's an online verifier which uses end-to-end validation with a bounded amount of RAM
It exposes http interface to control the workload and query statistics
- without bidirectional communication it's impossible to establish causality and to check progress at a given point of time
- http interface helps with clean shutdowns to avoid #4326
It has an error log which is pulled after the test is over (to help with debugging the failing tests)
We don't adopt external code (VerifiableProducer) so no licensing questions

Release notes

none

Jun 27 '22 02:06 rystsov

@rystsov What's the status on this PR? Does it need review or are we waiting for some changes?

Jul 14 '22 18:07 NyaliaLui

@NyaliaLui I've addressed the comments and it waits for review

Jul 14 '22 19:07 rystsov

Most of the changes focused on the test itself are fine, but I worry that we are repeating the same mistake of adding new workloads rather than extending/adapting existing ones.

@jcsp The new verifier follows a different approach to testing so extending the existing verifiers would require more work. Once this PR is merged I'll push for giving up on VerifiableProducer, VerifiableConsumer and KafProducer. I've described the key features of this verifier in the cover letter.

Jul 14 '22 20:07 rystsov

ci-failures:

https://github.com/redpanda-data/redpanda/issues/4772
https://github.com/redpanda-data/redpanda/issues/5589
https://github.com/redpanda-data/redpanda/issues/5591

Jul 22 '22 16:07 rystsov

@rystsov looks like a few merge conflicts

Jul 30 '22 20:07 dotnwat

BLOCKER: PartitionMoveInterruption.test_cancelling_partition_move https://github.com/redpanda-data/redpanda/issues/6167

Aug 22 '22 19:08 rystsov

EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures - https://github.com/redpanda-data/redpanda/issues/4639

Aug 22 '22 19:08 rystsov

PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade https://github.com/redpanda-data/redpanda/issues/5589

Aug 22 '22 19:08 rystsov

BLOCKER: PartitionBalancerTest.test_unavailable_nodes https://github.com/redpanda-data/redpanda/issues/6176

Aug 22 '22 23:08 rystsov

Y'all the PR was ready to merge a while ago, I fixed the conflicts and it's ready to go. IMHO we may do an exception and let it go in despite the blockers. They aren't caused by the PR and the PR introduces a verifier to solve the "Consumer failed to consume up to offsets" ci-failure. What do you think?

Aug 23 '22 05:08 rystsov

but it's not clear why "Reduce duration of partition_movement_test from 25min to 8min" this is true. Is there some insight into the time savings?

it was true back in summer but now the benefits should be smaller, the major boost came from optimizing wait_until which was put in a separate PR since them the second best boost came from using logic like:

start continuous workload
await n updates
perform partition movement
await n updates
stop the workload

instead of:

start workload & schedule N operations (where N is big enough)
perform partition movement
wait until the workload is stopped

I'll change the description to avoid the confusion

Dec 21 '22 01:12 rystsov

it was true back in summer but now the benefits should be smaller, the major boost came from optimizing wait_until

ahh thanks. i remember that wait_until pr now. thanks!

Dec 24 '22 22:12 dotnwat

redpanda redpanda copied to clipboard

Reduce duration of partition_movement_test from 25min to 8min

Cover letter

Release notes

redpanda
redpanda copied to clipboard