flow-framework icon indicating copy to clipboard operation
flow-framework copied to clipboard

[FEATURE] Deprovision resources if wait_for_completion_timeout is exceeded

Open yizheliu-amazon opened this issue 6 months ago • 5 comments

Is your feature request related to a problem?

As per current documentation, if wait_for_completion_timeout is exceeded for Provision API, execution continues asynchronously. As user of Provision API, I think it may be better to stop provisioning, and deprovision created resources, because timeout may indicate request should fail, which is more aligned with my expectation. Continuing execution asynchronously is actually same experience as calling Provision API without wait_for_completion_timeout.

What solution would you like?

Add one more param like fail_on_timeout:boolean. Default is false, which ensures backward compatibility. If value is specified as true, when wait_for_completion_timeout is exceeded, fail the request and deprovision the created resources.

What alternatives have you considered?

One alternative solution is to simply change existing behavior to fail the request and deprovision the created resources when timeout is exceeded, without adding fail_on_timeout param

Do you have any additional context?

N/A

yizheliu-amazon avatar Jun 23 '25 22:06 yizheliu-amazon

As per current documentation, if wait_for_completion_timeout is exceeded for Provision API, execution continues asynchronously.

~To be clear, only existing in-progress steps (which have already been triggered) continue to completion. The remainder of the workflow is cancelled.~

Actually I think that's what happens when a node/step timeout occurs rather than the whole wait for completion feature.

Still, the node timeouts could be used to enforce desired cancellation behavior.

As user of Provision API, I think it may be better to stop provisioning, and deprovision created resources,

~Agreed, but there's no way to stop a REST call that has already been sent and you're just waiting (asynchronously) for the result. We do "stop" provisioning as best we can.~

An immediate deprovision may miss the most recently (in progress) provisioned resource, so we'd need to wait on things, which means more threads monitoring and/or polling to see the status, or just arbitrarily waiting some amount of time.

This seems to me best handled from the client side.

dbwiddis avatar Jun 24 '25 00:06 dbwiddis

To clear up any confusion: the wait_for_completion_timeout is primarily intended to avoid the need to poll for status, by just delaying the return from the REST call.

Workflow failure from timeouts in the steps end the workflow execution and result in failure.

Both serve different purposes and both can be used together to achieve the desired behavior.

dbwiddis avatar Jun 24 '25 00:06 dbwiddis

Revisiting this to update the feature request:

  • automatic deprovisioning under some conditions is probably a very useful feature
  • given wait_for_completion_timeout's design, it's likely not the best option. Its main purpose is to avoid the need to poll, but it has little visibility into the workflow execution.
  • workflows do already fail on node/step timeouts, canceling any pending steps (currently running steps may complete)

So the discussion should really be under what conditions we could/should automatically deprovision (and determine how to convey that failure to the user)

Note there could be some overlap with #537, which applies to "a completed workflow without resources". We could consider having a "failed workflow with no resources" and permit provisioning in this case as well. The key point here is to remove the resources on failure, rather than just calling the deprovision API and resetting the state.

dbwiddis avatar Jul 16 '25 00:07 dbwiddis

Hi @dbwiddis , thanks for explanation regarding wait_for_completion_timeout .

Its main purpose is to avoid the need to poll

From customer perspective, I think I still need to write code to poll the status, because wait_for_completion_timeout may be exceeded. In other words, adding wait_for_completion_timeout may not help too much to me to avoid the poll, because I always need to have code to poll status of provisioning for case of timeout exceeded.

That is why I think automatic deprovisioning after timeout can simplify the work on customer side.

Assume resources can be cleaned up when timeout is exceeded, as customer, I may simply re-try by increasing wait_for_completion_timeout to longer time period, if failed again, re-try with longer time, until re-try hit maximum allowed times. And I am confident no dangling resources are there if all the re-try fail.

Overall, I just need to have wait_for_completion_timeout in my provision request, and re-try on timeout. I don't have need for polling, and have no need for manual clean up if provisioning eventually fails.

Please feel free to let me know your thoughts. Thank you.

yizheliu-amazon avatar Jul 16 '25 17:07 yizheliu-amazon

Hey @yizheliu-amazon I apologize for the slow response to this. Trying to address your comments here.

Its main purpose is to avoid the need to poll

From customer perspective, I think I still need to write code to poll the status, because wait_for_completion_timeout may be exceeded. In other words, adding wait_for_completion_timeout may not help too much to me to avoid the poll, because I always need to have code to poll status of provisioning for case of timeout exceeded.

Prior to adding this feature, all requests needed to poll for results.

The main intent of this feature was to permit provisioning that were expected to complete relatively quickly (seconds) to have a "synchronous" API that waited to return until it was complete.

The primary use case we were looking at was expected in most cases to complete in less than a second, and a 2s or longer timeout would be more than sufficient for 99.9% of the time.

In this case, instead of returning immediately with a "hey I'm provisioning, here's the workflow ID, ask me later" response, we wait a few hundred milliseconds and return "It worked! Do the next thing in your workflow!" response. Or, if the provisioning encountered an error, you'd still get a faster 4xx (or 5xx?) response indicating the failure, rather than having to specifically poll status for it.

The return on success gives you the full workflow status which you can evaluate to determine if provisioning is complete (the "state" will indicate if provisioning is DONE or still IN_PROGRESS.

So in the case that it's DONE, no polling is needed.

That is why I think automatic deprovisioning after timeout can simplify the work on customer side.

The problem here is that we should only deprovision after asynchronous processing is complete. On timeout, there can still be background provisioning. In the ideal use case for this parameter, this isn't the case; I'm saying it's the wrong timeout to be using for a failure case.

Assume resources can be cleaned up when timeout is exceeded, as customer, I may simply re-try by increasing wait_for_completion_timeout to longer time period, if failed again, re-try with longer time, until re-try hit maximum allowed times. And I am confident no dangling resources are there if all the re-try fail.

There are other timeouts you can use, specifically timeouts on the workflow steps, that will abort the workflow and cancel future steps. These are the ones that seem to apply to your use case.

Overall, I just need to have wait_for_completion_timeout in my provision request, and re-try on timeout. I don't have need for polling, and have no need for manual clean up if provisioning eventually fails.

The naming of this parameter was intentionally meant to parallel the asynchronous search capability in OpenSearch, where it's returning "partial results" earlier, along with information on getting more complete results later.

Please feel free to let me know your thoughts. Thank you.

For evaluating a failure case, you should use workflow step timeouts. These will cancel in-progress workflows. And while we haven't yet implemented automatic deprovisioning in this case, these timeouts are the ones that would be best suited to detect failures and respond appropriately. Specifically:

  1. Configure workflow step timeouts for each step
  2. Configure wait_for_completion timeout to be at least the sum of all the step timeouts
  3. On timeout of any step, the workflow will be cancelled. The in-progress step(s) may continue until completion even if timed out, as the underlying task may not get cancelled.
  4. Wait for completion will return "early" with the workflow failure, allowing immediate client-side deprovisioning without further polling, after a suitable delay.

dbwiddis avatar Aug 20 '25 20:08 dbwiddis