redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

PESDLC-1025 Retry when spikes occur in OMB

Open rpdevmp opened this issue 1 year ago • 4 comments

Detect latency spikes that could be due to underlying disks go through periods of 300-1000ms, and retry the test up to R number of times.

Backports Required

  • [x ] none - not a bug fix
  • [ ] none - this is a backport
  • [ ] none - issue does not exist in previous branches
  • [ ] none - papercut/not impactful enough to backport
  • [ ] v23.3.x
  • [ ] v23.2.x

Release Notes

  • none

Improvements

  • The issue is that the underlying disks go through periods of 300-1000ms where they have greatly reduced throughput and this produces a latency spike. This may eventually be solved by https://github.com/redpanda-data/core-internal/issues/1142 but in the meantime we should implement a workaround so that the tests can run giving a reasonable signal but not spuriously failing due to this issue.

My suggestion is that if a test fails due to p99/p999 threshold breach, we check if it looks like it has spikes using a simple heuristic based on the time-series percentiles (see eg https://github.com/redpanda-data/core-internal/issues/1016#issuecomment-1939676486) and if it has say 1-3 spikes we just retry, up to R times.

The primary goals:

Avoid the noise from these spikes which prevents us from keeping the test enabled the test as it is too flaky Still catch regressions that are not of a "spike" nature but where the latency is bad during the whole run or a large portion of it Only allow spikes like the ones we are seeing to trigger a retry: i.e., not too many of them, since then we could miss some change in RP which causes frequent spikes for a HW reason.

More details are at: https://github.com/redpanda-data/core-internal/issues/1180

This is a DRAFT, that uses mocked data from previous runs with latency. Need to use actual data and perform more testing.

rpdevmp avatar Mar 28 '24 01:03 rpdevmp

  1. Created new decorator so it could be reused with different tests: @with_spike_detection_retry() we can specify number of retries default is 3
  2. Added this decorator and retry logic to "test_max_partitions", as a next step we will add to other tests as well
  3. If we detect latency, we would check for spikes and retry

rpdevmp avatar Apr 02 '24 05:04 rpdevmp

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47192#018e9da0-836b-4095-93dc-bb860933c7a6

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47580#018ec50a-e4aa-49a8-8048-da07d7a7abfb

vbotbuildovich avatar Apr 02 '24 08:04 vbotbuildovich

new failures in https://buildkite.com/redpanda/redpanda/builds/47192#018e9da7-cb92-44ab-983a-3ab0d226edc1:

"rptest.tests.simple_e2e_test.SimpleEndToEndTest.test_relaxed_acks.write_caching=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/47367#018ea7b7-527e-4966-b96a-c7d924042ba1:

"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_rebalancing_node.shutdown_decommissioned=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/47367#018ea7c1-1348-4633-9f31-a800461c5f4c:

"rptest.tests.cloud_storage_timing_stress_test.CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves.cleanup_policy=delete"

new failures in https://buildkite.com/redpanda/redpanda/builds/48410#018f290b-6092-4ad3-9b0b-1c068bd16aa0:

"rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False"

vbotbuildovich avatar Apr 02 '24 08:04 vbotbuildovich

/ci-repeat 1

travisdowns avatar Apr 16 '24 20:04 travisdowns