azure-webjobs-sdk-extensions Retry policies for isolated (out-of-process) AF Cosmos DB triggers/bindings soon deprecated?

Retry policies for isolated (out-of-process) AF Cosmos DB triggers/bindings soon deprecated?

Open HMoen opened this issue 2 years ago • 55 comments

@kshyju answer below works at the moment, but now we recently started to see the following trace in AppInsights for our AF Cosmos DB triggers : "Soon retries will not be supported for function '[Function Name]'. For more information, please visit http://aka.ms/func-retry-policies."

The Retry examples section also shows "Retry policies aren't yet supported when running in an isolated process." and the Retries section reflects no support for the Cosmos DB trigger/binding.

What's the path forward for AF Cosmos DB triggers running out-of-process?

Yes, retry policies are supported in isolated(out-of-process) function apps. You can enable it by adding the retry section to your host config. Here is an example host.json

{
  "version": "2.0",
  "retry": {
    "strategy": "fixedDelay",
    "maxRetryCount": 2,
    "delayInterval": "00:00:03"
  }
}

The reason why I'm asking is the documentation mentioning Retries require NuGet package Microsoft.Azure.WebJobs >= 3.0.23

That documentation which refers the usage of ExponentialBackoffRetry attribute is for in-proc function apps.

Please give it a try and let us know if you run into any problems.

Originally posted by @kshyju in https://github.com/Azure/azure-functions-dotnet-worker/issues/832#issuecomment-1072545934

Jun 10 '22 23:06 HMoen

Because this is not exclusive to the isolated model, we should move this item to the extension repo. My understanding is that this would be a feature request for the Cosmos DB extension.

Jun 29 '22 18:06 mattchenderson

Will there be any guidance on retries for Cosmos DB triggered functions before the October 2022 deadline when retry policy support will be removed?

Cosmos DB is currently the outlier on the table of retry support for triggers/bindings, with "n/a" and "Not configurable" listed. https://docs.microsoft.com/en-us/azure/azure-functions/functions-bindings-error-pages#retries

Jul 19 '22 13:07 MikeNicholls

Hi @ealsur Could you please help with this issue?

Jul 26 '22 06:07 Ved2806

Will there be any guidance on retries for Cosmos DB triggered functions before the October 2022 deadline when retry policy support will be removed?

Cosmos DB is currently the outlier on the table of retry support for triggers/bindings, with "n/a" and "Not configurable" listed. https://docs.microsoft.com/en-us/azure/azure-functions/functions-bindings-error-pages#retries

I agree, we built our change feed processing with the retries in mind and we were just about to roll-out our solution. While moving to webjobs with our own change feed processor is a viable option, we would be losing other benefits that azure functions can provide. Any guidance for cosmos db change feed retries would be appreciated.

Aug 10 '22 01:08 iamarvil

At least at this time, there is no Retry Policy support on the Cosmos DB extension that is built in or uses any of those configurations

Aug 15 '22 21:08 ealsur

@ealsur , any plans to implement before the October 2022 deadline mentioned above? If not, so we can plan our own retry mechanism (queue etc). Thanks.

Aug 16 '22 00:08 HMoen

If I understand the notice correctly, the retry policy is being removed from all triggers that currently support it:

The retry policy support in the runtime for triggers other than Timer and Event Hubs is being removed after this feature becomes generally available (GA). Preview retry policy support for all triggers other than Timer and Event Hubs will be removed in October 2022.

So adding Retry Policy support would not make sense as it does not align with the notice.

We currently have no planned work to add custom retry configuration, but that can change if this is a feature that is widely requested (as an opt-in configuration). Technically, the Trigger could allow infinite retries or no retries (nothing intermediate) because we internally use the Change Feed Processor and those would be the only available options.

Aug 16 '22 14:08 ealsur

@ealsur , thanks for the feedback. Infinite or none won't work in our environment unfortunately. We piggybacked off dead-letter-queues before and removed the functionality when we found out about the built-in retry mechanism.

Link added to upvote idea.

Aug 16 '22 16:08 HMoen

The removal of this has a very large impact on many of our production apps that use change feed (cosmos db trigger) with exponential infinite retries. We obviously don't want to lose and messages and rely heavily on this. @ealsur Is there anyway to get this added before the deadline or possibly a sample workaround? I would much rather not bring in more infra (queues) and build something specifically for this in all our apps.

Aug 17 '22 13:08 mpaul31

@mpaul31 I don't understand what is the request here. Right now the Change Feed Trigger has no retry policy, so it never retries, how are you creating apps that have exponential infinite retries? The deadline that the documentation shows is a deadline after which the Retry Policies will be removed, for Cosmos DB they are not even there in the first place, so there is no point in adding them if they are going to be removed anyway?

This issue requires input from the Azure Functions team for clarification @Ved2806

Aug 18 '22 15:08 ealsur

@ealsur We use the following attribute on our change feed trigger method:

[ExponentialBackoffRetry(-1, "00:00:01", "00:00:10")]

It would be awesome if we could configure similar behavior on the CosmosDBTrigger.

Aug 18 '22 15:08 mpaul31

There is no code in the Cosmos DB Trigger or extension that reacts to [ExponentialBackoffRetry(-1, "00:00:01", "00:00:10")].

The Trigger code in fact has no error retry logic, here is where your Function code is called: https://github.com/Azure/azure-webjobs-sdk-extensions/blob/dev/src/WebJobs.Extensions.CosmosDB/Trigger/CosmosDBTriggerListener.cs#L225 if there was an execution error, it is simply logged and the next batch of changes is processed. This is mentioned in the Cosmos DB documentation (https://docs.microsoft.com/en-us/azure/cosmos-db/sql/troubleshoot-changefeed-functions#some-changes-are-missing-in-my-trigger).

There isn't (and never was) any logic that would pickup that ExponentialBackoffRetry decorator information in the Cosmos DB Trigger. If the Functions runtime is doing something with it I don't know, but the Trigger code moves forward always. My understanding is that the expectation that these RetryPolicy decorators had any impact on the CosmosDBTrigger are incorrect, so basically they are not doing anything, hence, when they get deprecated, there is nothing that will change.

Aug 19 '22 18:08 ealsur

@ealsur , on this thread https://github.com/Azure/azure-functions-dotnet-worker/issues/832#issuecomment-1072545934, @kshyju mentioned the ExponentialBackoffRetry only works(ed) for in-proc function apps.

Aug 19 '22 19:08 HMoen

I don't know how that works, from the Trigger perspective, there is nothing that is blocking the checkpointing and moving forward if the execution of the code fails.

Once the ProcessChangesAsync method completes in the Trigger code without errors, the state is saved and a next batch is obtained, and the execution of the user code can never make the method fail because we are using TryExecuteAsync:

FunctionResult result = await this._executor.TryExecuteAsync(new TriggeredFunctionData() { TriggerValue = docs }, cancellationToken);

For any logic to correctly affect the Trigger and make it retry the current batch (not checkpoint), the ProcessChangesAsync method should fail. That is how the Change Feed Processor would do the actual retry.

There is also no in-proc specific code in the CosmosDBTriggerListener either.

The linked comment is not saying that the configuration applies to CosmosDBTrigger either.

Aug 19 '22 19:08 ealsur

The issue https://github.com/Azure/azure-functions-dotnet-worker/issues/832, was opened by someone looking specifically for retry logic for CosmosDBTrigger and @kshyju offered the linked comment as the solution. That issue should be updated to reflect your comments.

Aug 19 '22 20:08 HMoen

I will let @kshyju comment then, as author of the CosmosDBTrigger, I am not aware of any behavior affected by these policies in the Trigger source code.

Aug 19 '22 21:08 ealsur

The issue Azure/azure-functions-dotnet-worker#832, was opened by someone looking specifically for retry logic for CosmosDBTrigger and @kshyju offered the linked comment as the solution. That issue should be updated to reflect your comments.

The retry option I mentioned is a function runtime feature which will retry the function execution when it fails for the first time

For example, if you have a function like this.

[Function("Function1")]
public void Run([CosmosDBTrigger(
    databaseName: "items-db",
    collectionName: "products-container",
    ConnectionStringSetting = "MyCosmosDbConnStr",
    LeaseCollectionName = "leases")] IReadOnlyList<MyDocument> input)
{
    if (input != null && input.Count > 0)
    {
        _logger.LogInformation($"Documents modified: {input.Count}");
        _logger.LogInformation($"First document Id: {input[0].Id}");

        // example to simulate failed function execution
        if (input[0].Name.Contains("foo"))
        {
            throw new ArgumentNullException("Bad name for  document");
        }
    }
}

and configure your host.json to enable retry behavior

{
  "version": "2.0",
  "retry": {
    "strategy": "fixedDelay",
    "maxRetryCount": 2,
    "delayInterval": "00:00:03"
  }
}

Let' say you edit some document in your cosmosdb collection, the Function1 function will be triggered and if the document name contains "foo", the function code simply throws, which creates a failed function execution. The function runtime(host) will retry the same trigger again after 3 seconds and this happens 2 times.

So in short, the retry behavior is something the functions runtime/host does based on what is configured on host.json. Hope this helps.

Aug 19 '22 22:08 kshyju

@kshyju , thanks for your response. The retry behavior you depict above is/will be permanent in the runtime for out-of-proc functions?

I opened this issue since as stated above early June we started to see the following trace in AppInsights for our AF Cosmos DB triggers: "Soon retries will not be supported for function '[Function Name]'. For more information, please visit http://aka.ms/func-retry-policies."

Aug 19 '22 22:08 HMoen

The function runtime(host) will retry the same trigger again after 3 seconds and this happens 2 times.

Will the execution be with the same input (basically the same list of IReadOnlyList<MyDocument> input)? I'm intrigued to understand how this works, I have not added any logic in the Trigger that would stop the checkpointing, unless this decorator is basically stopping the instance?

Aug 19 '22 22:08 ealsur

@ealsur Yes, Same list. The logic is somewhere in the functions host I believe.

Aug 19 '22 22:08 kshyju

@kshyju , thanks for your response. The retry behavior you depict above is/will be permanent in the runtime for out-of-proc functions?

I opened this issue since as stated above early June we started to see the following trace in AppInsights for our AF Cosmos DB triggers: "Soon retries will not be supported for function '[Function Name]'. For more information, please visit http://aka.ms/func-retry-policies."

This is the current behavior as of today. Let me circle back with the team to understand what is the migration path & update you.

Aug 19 '22 22:08 kshyju

@kshyju Any updates on the migration path or postponing the removal?

Aug 29 '22 14:08 mpaul31

The retry behavior is currently a preview feature. As the deprecation notice mentions, this feature will only be available for Timer and EventHub triggers for GA. Preview features are considered experimental and meant to collect feedback and not all of them make it to GA.

This will be applicable for both in-proc and isolated.

Sep 01 '22 00:09 kshyju

I get that but just like stage slots, which was in preview for a very (1+ years) long time, some adopt things sooner out of necessity. What about guidance then? Will the host just ignore the attribute or stop the app from starting? I'm sure there were many excited to see this as it relates to the cosmos trigger and are now going to be facing the same fate.

Sep 01 '22 13:09 mpaul31

The fundamental issue is not that CosmosDBTrigger cannot be configured with a retry policy, but that CosmosDBTrigger advances the lease when a Function fails to execute.

I believe this could be solved if CosmosDBTrigger instead of retry policy could use exponential backoff to extend polling time for Change Feed without advancing the lease on Functions failure.

Sep 03 '22 16:09 shibayan

Just to be clear, the retry policy/attribute does indeed work today aka with lease checkpointing working correctly. It appears it using an in memory retry policy similar to Polly.

Sep 03 '22 17:09 mpaul31

I get that but just like stage slots, which was in preview for a very (1+ years) long time, some adopt things sooner out of necessity. What about guidance then? Will the host just ignore the attribute or stop the app from starting? I'm sure there were many excited to see this as it relates to the cosmos trigger and are now going to be facing the same fate.

Agree, I imagine this feature will have had a lot of adoption out of necessity over the 1-2 years of availability. With the loss of ExponentialBackoffRetry transient failures are going to become a lot more expensive again. Knock on effect to compute, source/target ops, logging etc. Customers need guidance on what will happen in October and how to mitigate without the out-of-the-box in-memory retry option. Removing preview support for retry patterns and the alignment towards azure cloud best practices seems a strange decision.

Sep 05 '22 16:09 cheekp

This is certainly required for Cosmos.

Before I started using this attribute I encountered issues where "in flight" batches of changes were lost when performing a release.

I was advised (funnily enough by @ealsur!) to use the retry policy to mitigate this here

https://github.com/Azure/azure-webjobs-sdk-extensions/issues/716#issuecomment-824920182

In general too the change feed processing may well be writing to things with 99.999% availability and there needs to be a way to handle the 0.001% without just skipping that batch and moving on.

Sep 07 '22 11:09 asos-martinsmith

I was advised (funnily enough by @ealsur!) to use the retry policy to mitigate this here

My response does not advise you to you the retry policies, the response is regarding what happens if your function crashes before checkpointing. I mentioned the documentation about the retry policies that were in preview because my understanding back then was that Functions was building a generic retry layer for all extensions, I did not explicitly say to use it. I don't work on the Functions team and have no insight on the roadmap nor plans for their features. I wasn't even aware these Retry policies affected the Cosmos DB Trigger (as you can see on this thread).

In general too the change feed processing may well be writing to things with 99.999% availability and there needs to be a way to handle the 0.001% without just skipping that batch and moving on.

There is, a try/catch approach on the Function can let you handle any individual item failure. This pattern is commonly used (because retrying the entire batch is not something you would want to do in some cases), where the individual items that failed processing are sent to a queue or another storage for later processing or analyzing (why did they fail? is the failure transient?)

Sep 07 '22 14:09 ealsur

There are two separate issues here.

The first one is that certainly I have observed that deploying changes to the function app means that batches end up being aborted (maybe due to thread abort) and the lease is still updated so the net effect is that the batch is just missed. After using this preview feature I did not observe any instances of this. When it is removed I no longer will have confidence in any method of deployment that avoids this, (it was never got to the bottom of this issue in the linked thread)

The second is that I don't want to provision a whole load of largely redundant resources for the 0.001% case (and write code to then merge from the fallback storage to the main storage). I just want the checkpoint not to be written and for it to be retried later (as happens with the exponential retry). Is there anything that can be put in the catch block that will achieve the suppression of checkpointing?

(Typical use case for me is to use the Change feed to upsert to a different Cosmos - e.g. to maintain a materialized view of a collection so pretty much all errors will be transient such as throttling and succeed on retry)

Sep 07 '22 14:09 asos-martinsmith

azure-webjobs-sdk-extensions azure-webjobs-sdk-extensions copied to clipboard

Retry policies for isolated (out-of-process) AF Cosmos DB triggers/bindings soon deprecated?

azure-webjobs-sdk-extensions
azure-webjobs-sdk-extensions copied to clipboard