redpanda PESDLC-1025 Retry when spikes occur in OMB

Detect latency spikes that could be due to underlying disks go through periods of 300-1000ms, and retry the test up to R number of times.

Backports Required

[x ] none - not a bug fix
[ ] none - this is a backport
[ ] none - issue does not exist in previous branches
[ ] none - papercut/not impactful enough to backport
[ ] v23.3.x
[ ] v23.2.x

Release Notes

none

Improvements

The issue is that the underlying disks go through periods of 300-1000ms where they have greatly reduced throughput and this produces a latency spike. This may eventually be solved by https://github.com/redpanda-data/core-internal/issues/1142 but in the meantime we should implement a workaround so that the tests can run giving a reasonable signal but not spuriously failing due to this issue.

My suggestion is that if a test fails due to p99/p999 threshold breach, we check if it looks like it has spikes using a simple heuristic based on the time-series percentiles (see eg https://github.com/redpanda-data/core-internal/issues/1016#issuecomment-1939676486) and if it has say 1-3 spikes we just retry, up to R times.

The primary goals:

Avoid the noise from these spikes which prevents us from keeping the test enabled the test as it is too flaky Still catch regressions that are not of a "spike" nature but where the latency is bad during the whole run or a large portion of it Only allow spikes like the ones we are seeing to trigger a retry: i.e., not too many of them, since then we could miss some change in RP which causes frequent spikes for a HW reason.

More details are at: https://github.com/redpanda-data/core-internal/issues/1180

This is a DRAFT, that uses mocked data from previous runs with latency. Need to use actual data and perform more testing.

Mar 28 '24 01:03 rpdevmp

Created new decorator so it could be reused with different tests: @with_spike_detection_retry() we can specify number of retries default is 3
Added this decorator and retry logic to "test_max_partitions", as a next step we will add to other tests as well
If we detect latency, we would check for spikes and retry

Apr 02 '24 05:04 rpdevmp

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47192#018e9da0-836b-4095-93dc-bb860933c7a6

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47580#018ec50a-e4aa-49a8-8048-da07d7a7abfb

Apr 02 '24 08:04 vbotbuildovich

new failures in https://buildkite.com/redpanda/redpanda/builds/47192#018e9da7-cb92-44ab-983a-3ab0d226edc1:

"rptest.tests.simple_e2e_test.SimpleEndToEndTest.test_relaxed_acks.write_caching=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/47367#018ea7b7-527e-4966-b96a-c7d924042ba1:

"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_rebalancing_node.shutdown_decommissioned=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/47367#018ea7c1-1348-4633-9f31-a800461c5f4c:

"rptest.tests.cloud_storage_timing_stress_test.CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves.cleanup_policy=delete"

new failures in https://buildkite.com/redpanda/redpanda/builds/48410#018f290b-6092-4ad3-9b0b-1c068bd16aa0:

"rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False"

Apr 02 '24 08:04 vbotbuildovich

/ci-repeat 1

Apr 16 '24 20:04 travisdowns

redpanda redpanda copied to clipboard

PESDLC-1025 Retry when spikes occur in OMB

Backports Required

Release Notes

Improvements

redpanda
redpanda copied to clipboard