redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

Reduce duration of partition_movement_test from 25min to 8min

Open rystsov opened this issue 2 years ago • 5 comments

Cover letter

  • reduces duration of partition_movement_test from 25min to 8min
  • introduce a workload generator / consistency checker to verified producer / consumers

The new verifier is based on the verifiers we use in chaos tests. The key features of this verifier:

  • It's an app using a client instead of being a CLI kafka tool running in a bash loop (KafProducer) which results in more organic usage pattern.
  • Low distance between kafka client and the app: app->driver vs app->kgo->driver. The lower distance means it's easier to debug when something goes wrong.
  • It uses the most popular client (easier to distinguish client issue vs redpanda issue).
  • It detects out of order and duplicated messages
  • It already detected this problem https://github.com/redpanda-data/redpanda/pull/5356
  • It treats timeouts as indecisive errors (not subject to #4702)
  • It's an online verifier which uses end-to-end validation with a bounded amount of RAM
  • It exposes http interface to control the workload and query statistics
    • without bidirectional communication it's impossible to establish causality and to check progress at a given point of time
    • http interface helps with clean shutdowns to avoid #4326
  • It has an error log which is pulled after the test is over (to help with debugging the failing tests)
  • We don't adopt external code (VerifiableProducer) so no licensing questions

Release notes

  • none

rystsov avatar Jun 27 '22 02:06 rystsov

@rystsov What's the status on this PR? Does it need review or are we waiting for some changes?

NyaliaLui avatar Jul 14 '22 18:07 NyaliaLui

@NyaliaLui I've addressed the comments and it waits for review

rystsov avatar Jul 14 '22 19:07 rystsov

Most of the changes focused on the test itself are fine, but I worry that we are repeating the same mistake of adding new workloads rather than extending/adapting existing ones.

@jcsp The new verifier follows a different approach to testing so extending the existing verifiers would require more work. Once this PR is merged I'll push for giving up on VerifiableProducer, VerifiableConsumer and KafProducer. I've described the key features of this verifier in the cover letter.

rystsov avatar Jul 14 '22 20:07 rystsov

ci-failures:

  • https://github.com/redpanda-data/redpanda/issues/4772
  • https://github.com/redpanda-data/redpanda/issues/5589
  • https://github.com/redpanda-data/redpanda/issues/5591

rystsov avatar Jul 22 '22 16:07 rystsov

@rystsov looks like a few merge conflicts

dotnwat avatar Jul 30 '22 20:07 dotnwat

BLOCKER: PartitionMoveInterruption.test_cancelling_partition_move https://github.com/redpanda-data/redpanda/issues/6167

rystsov avatar Aug 22 '22 19:08 rystsov

EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures - https://github.com/redpanda-data/redpanda/issues/4639

rystsov avatar Aug 22 '22 19:08 rystsov

PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade https://github.com/redpanda-data/redpanda/issues/5589

rystsov avatar Aug 22 '22 19:08 rystsov

BLOCKER: PartitionBalancerTest.test_unavailable_nodes https://github.com/redpanda-data/redpanda/issues/6176

rystsov avatar Aug 22 '22 23:08 rystsov

Y'all the PR was ready to merge a while ago, I fixed the conflicts and it's ready to go. IMHO we may do an exception and let it go in despite the blockers. They aren't caused by the PR and the PR introduces a verifier to solve the "Consumer failed to consume up to offsets" ci-failure. What do you think?

rystsov avatar Aug 23 '22 05:08 rystsov

but it's not clear why "Reduce duration of partition_movement_test from 25min to 8min" this is true. Is there some insight into the time savings?

it was true back in summer but now the benefits should be smaller, the major boost came from optimizing wait_until which was put in a separate PR since them the second best boost came from using logic like:

  • start continuous workload
  • await n updates
  • perform partition movement
  • await n updates
  • stop the workload

instead of:

  • start workload & schedule N operations (where N is big enough)
  • perform partition movement
  • wait until the workload is stopped

I'll change the description to avoid the confusion

rystsov avatar Dec 21 '22 01:12 rystsov

it was true back in summer but now the benefits should be smaller, the major boost came from optimizing wait_until

ahh thanks. i remember that wait_until pr now. thanks!

dotnwat avatar Dec 24 '22 22:12 dotnwat