argo-rollouts
argo-rollouts copied to clipboard
Make analysis succeed faster
Summary
Support succeeding earlier - once sufficiently many succeeding measurement are available. This aligns the success criteria with the failure / error criteria.
Use Cases
This would support polling for a criterion and immediately exiting the measurement phase once the condition is met. There are two parts for allowing that:
Skipping measurements that can't avoid success
Based on various error, failure and inconclusive limits, some measurement rounds can't avoid overall success. This can be used to skip further measurements and succeed early. I've already drafted an initial implementation for this and would value feedback whether this is appreciated as PR.
Introduce success limit
The above extension only brings us half way through: Assuming a count
of 5, we would have to specify error and failure limits of 4
in order to stop upon the first success. But doing so would already conclude success after measuring [Error, Failure]
despite the full list of measurements might continue as [Error, Failure, Failure, Error, Error]
(i.e. overall error instead of success).
This can be avoided by introducing a successLimit
. In analogy to the failureLimit
, more than successLimit
many succeeding measurements would immediately result in overall success and prematurely end the evaluation. The default of this successLimit
should be (at least) count - 1
for reasons of backwards compatibility. Specifying successLimit = 0
would lead to the desired effect in my initial example.
I can volunteer to implement this extension, in case it would be appreciated. Please provide some feedback or comments about requirements of potential implementation details.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
This issue is stale because it has been open 60 days with no activity.
This issue is stale because it has been open 60 days with no activity.
This would be an awesome feature and greatly reduce the time a new rollout can be mark as healthy and serving traffic.
This issue is stale because it has been open 60 days with no activity.
Yes please! We need this, I have exactly the same scenario!
Hello, here is a workaround. First, enable dryRun:
dryRun:
- metricName: my-metric-name
This changes the behavior of Argo Rollouts so that the analysis step doesn't pause when it goes Inconclusive and it doesn't abort the rollout when the step fails. Then set the count to the maximum number of iterations you want. In addition, set the failureLimit to the number of iterations minus one. Finally, set inconclusiveLimit to 0, as below:
count: 100
failureLimit: 99
inconclusiveLimit: 0
Then configure the metric to:
- never succeed,
- return failure when it's supposed to continue polling
- return inconclusive when it's supposed to stop.
successCondition: "false"
failureCondition: "result == 'pending'"
Your analysis step will now poll for 100 times unless the result doesn't match the failure condition, in which case it will return inconclusive, immediately stop iterating and the rollout will proceed to the next step (it won't pause, because we're using dry run). If the result matches the failure condition on the last iteration, the rollout step will also proceed to the next step (because we are using the dry run), even though the result of the step will be a failure.
If you want the rollout to actually degrade on failure or pause on inconclusive, you can add another rollout step immediately afterwards, but this time without dryRun, and run the check once more, so that you can return success/failure/inconclusive as desired.
I hope this helps!