argo-rollouts icon indicating copy to clipboard operation
argo-rollouts copied to clipboard

Make analysis succeed faster

Open m8mble opened this issue 3 years ago • 6 comments

Summary

Support succeeding earlier - once sufficiently many succeeding measurement are available. This aligns the success criteria with the failure / error criteria.

Use Cases

This would support polling for a criterion and immediately exiting the measurement phase once the condition is met. There are two parts for allowing that:

Skipping measurements that can't avoid success

Based on various error, failure and inconclusive limits, some measurement rounds can't avoid overall success. This can be used to skip further measurements and succeed early. I've already drafted an initial implementation for this and would value feedback whether this is appreciated as PR.

Introduce success limit

The above extension only brings us half way through: Assuming a count of 5, we would have to specify error and failure limits of 4 in order to stop upon the first success. But doing so would already conclude success after measuring [Error, Failure] despite the full list of measurements might continue as [Error, Failure, Failure, Error, Error] (i.e. overall error instead of success).

This can be avoided by introducing a successLimit. In analogy to the failureLimit, more than successLimit many succeeding measurements would immediately result in overall success and prematurely end the evaluation. The default of this successLimit should be (at least) count - 1 for reasons of backwards compatibility. Specifying successLimit = 0 would lead to the desired effect in my initial example.

I can volunteer to implement this extension, in case it would be appreciated. Please provide some feedback or comments about requirements of potential implementation details.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

m8mble avatar Feb 06 '22 10:02 m8mble

This issue is stale because it has been open 60 days with no activity.

github-actions[bot] avatar Nov 12 '22 03:11 github-actions[bot]

This issue is stale because it has been open 60 days with no activity.

github-actions[bot] avatar Jan 13 '23 02:01 github-actions[bot]

This would be an awesome feature and greatly reduce the time a new rollout can be mark as healthy and serving traffic.

tonynguyen-ccl avatar Apr 17 '23 08:04 tonynguyen-ccl

This issue is stale because it has been open 60 days with no activity.

github-actions[bot] avatar Jun 17 '23 02:06 github-actions[bot]

Yes please! We need this, I have exactly the same scenario!

doniyorniazov avatar Sep 13 '23 03:09 doniyorniazov

Hello, here is a workaround. First, enable dryRun:

  dryRun:
  - metricName: my-metric-name

This changes the behavior of Argo Rollouts so that the analysis step doesn't pause when it goes Inconclusive and it doesn't abort the rollout when the step fails. Then set the count to the maximum number of iterations you want. In addition, set the failureLimit to the number of iterations minus one. Finally, set inconclusiveLimit to 0, as below:

  count: 100
  failureLimit: 99
  inconclusiveLimit: 0

Then configure the metric to:

  • never succeed,
  • return failure when it's supposed to continue polling
  • return inconclusive when it's supposed to stop.
  successCondition: "false"
  failureCondition: "result == 'pending'"

Your analysis step will now poll for 100 times unless the result doesn't match the failure condition, in which case it will return inconclusive, immediately stop iterating and the rollout will proceed to the next step (it won't pause, because we're using dry run). If the result matches the failure condition on the last iteration, the rollout step will also proceed to the next step (because we are using the dry run), even though the result of the step will be a failure.

If you want the rollout to actually degrade on failure or pause on inconclusive, you can add another rollout step immediately afterwards, but this time without dryRun, and run the check once more, so that you can return success/failure/inconclusive as desired.

I hope this helps!

akorzy-pl avatar Oct 24 '23 17:10 akorzy-pl