redpanda
redpanda copied to clipboard
PESDLC-1025 Retry when spikes occur in OMB
Detect latency spikes that could be due to underlying disks go through periods of 300-1000ms, and retry the test up to R number of times.
Backports Required
- [x ] none - not a bug fix
- [ ] none - this is a backport
- [ ] none - issue does not exist in previous branches
- [ ] none - papercut/not impactful enough to backport
- [ ] v23.3.x
- [ ] v23.2.x
Release Notes
- none
Improvements
- The issue is that the underlying disks go through periods of 300-1000ms where they have greatly reduced throughput and this produces a latency spike. This may eventually be solved by https://github.com/redpanda-data/core-internal/issues/1142 but in the meantime we should implement a workaround so that the tests can run giving a reasonable signal but not spuriously failing due to this issue.
My suggestion is that if a test fails due to p99/p999 threshold breach, we check if it looks like it has spikes using a simple heuristic based on the time-series percentiles (see eg https://github.com/redpanda-data/core-internal/issues/1016#issuecomment-1939676486) and if it has say 1-3 spikes we just retry, up to R times.
The primary goals:
Avoid the noise from these spikes which prevents us from keeping the test enabled the test as it is too flaky Still catch regressions that are not of a "spike" nature but where the latency is bad during the whole run or a large portion of it Only allow spikes like the ones we are seeing to trigger a retry: i.e., not too many of them, since then we could miss some change in RP which causes frequent spikes for a HW reason.
More details are at: https://github.com/redpanda-data/core-internal/issues/1180
This is a DRAFT, that uses mocked data from previous runs with latency. Need to use actual data and perform more testing.
- Created new decorator so it could be reused with different tests: @with_spike_detection_retry() we can specify number of retries default is 3
- Added this decorator and retry logic to "test_max_partitions", as a next step we will add to other tests as well
- If we detect latency, we would check for spikes and retry
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47192#018e9da0-836b-4095-93dc-bb860933c7a6
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47580#018ec50a-e4aa-49a8-8048-da07d7a7abfb
new failures in https://buildkite.com/redpanda/redpanda/builds/47192#018e9da7-cb92-44ab-983a-3ab0d226edc1:
"rptest.tests.simple_e2e_test.SimpleEndToEndTest.test_relaxed_acks.write_caching=True"
new failures in https://buildkite.com/redpanda/redpanda/builds/47367#018ea7b7-527e-4966-b96a-c7d924042ba1:
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_rebalancing_node.shutdown_decommissioned=True"
new failures in https://buildkite.com/redpanda/redpanda/builds/47367#018ea7c1-1348-4633-9f31-a800461c5f4c:
"rptest.tests.cloud_storage_timing_stress_test.CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves.cleanup_policy=delete"
new failures in https://buildkite.com/redpanda/redpanda/builds/48410#018f290b-6092-4ad3-9b0b-1c068bd16aa0:
"rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False"
/ci-repeat 1