azure-webjobs-sdk icon indicating copy to clipboard operation
azure-webjobs-sdk copied to clipboard

EventHubTrigger does not retry messages when stopping function with cancellationToken.

Open MaximeJonckheere opened this issue 1 year ago • 6 comments

Our environment uses functions with an EventHubTrigger to process messages from EventHubs. We also use the ExponentialBackoffRetry attribute to retry unhandled exceptions. We use a consumption plan to scale our functions.

We have reoccuring cases of messages not being processed. Further investigation shows that these are always messages being processed in a function that is being stopped (when scaling down). In these cases, the cancellationToken has been cancelled, which causes an OperationCanceledException. Because we use the ExponentialBackoffRetry attribute, these exceptions are not handled in a try catch block.

When the function resumes, it processes new messages and the previous messages are lost. It seems, that in this case, these messages are marked as processed and checkpoint has been updated.

We expected, that when a function get stopped (scaling down, maintanance, …). The checkpoint would not update and the messages would be processed in the next function execution.

Is this an incorrect understanding of how it works or is there something missing in how we use the EventHubTrigger?

MaximeJonckheere avatar Jan 10 '23 07:01 MaximeJonckheere

I've just ran into the same issue.

The combination seems to be:

  1. When using a retry policy (ExponentialBackoffRetry)
  2. And an OperationCancelledException occurs in an Azure Function Execution
    • Note: the OperationCancelledException is caused by scaling down, via the CancellationToken

==> Then the retry policy is ignored and the messages are skipped

Since this results in messages being lost, I would consider this a very high priority issue!

ThomasVandenbon avatar Jan 12 '23 11:01 ThomasVandenbon

Same here. Also tried with fixed retry policy. When OperationCancelledException happens, checkpoint seems to still advance and message (not finished processing) is lost. Yes, this seems to be a critical issue..

jack0fshad0ws avatar Jan 14 '23 22:01 jack0fshad0ws

This issue is causing us to lose messages on a daily basis. It causes us to have to extinguish random fires constantly.

It's been 2 weeks, could someone at least confirm that this is being looked at? If not @soninaren, then maybe @mathewc, @brettsam or @fabiocav?

We need to know if in the future we'll be able to rely on the combination of Azure Functions, Event Hubs and Retry attributes or not.

ThomasVandenbon avatar Jan 27 '23 15:01 ThomasVandenbon

Not sure if this should live here or in EventHubs -- @alrod or @sidkri do you know what behavior needs fixing for this?

brettsam avatar Jan 27 '23 16:01 brettsam

Not sure if this should live here or in EventHubs -- @alrod or @sidkri do you know what behavior needs fixing for this?

@alrod @sidkri Any thoughts?

kshyju avatar Mar 23 '23 18:03 kshyju

Hey, for future readings: The solution is to update the Microsoft.Azure.WebJobs.Extensions.EventHubs to 6.0.1 (the newest at the moment of writing this comment).

There was added below condition: https://github.com/Azure/azure-sdk-for-net/blame/2818ef19e820f502591bce7c79d1c10a50bb01be/sdk/eventhub/Microsoft.Azure.WebJobs.Extensions.EventHubs/src/Listeners/EventHubListener.PartitionProcessor.cs#L228C20-L228C20

Long story to describe my context:

My case: I'm using the single dispatch approach in my function (the function is called with a single EventData object). My goal is to implement Exactly-once-processing messaging. For that, I have implemented the transactional Inbox pattern on my side, my code is Idempotent + I have added [FixedDelayRetry(-1, "00:00:03")] attribute to my function. Intentionally I want to stop (repeat forever) processing events in case the exception is thrown from my function. I'm using a Dedicated hosting plan (on App Service Plan), so it's not a problem from the money perspective. I have added proper monitoring/alerting for such cases. In my case, the worst thing is to omit an event or consume that in the wrong order (in context of single Tenant of my application)

I have checked locally (actually debugging locally the Microsoft.Azure.WebJobs.Extensions.EventHubs built into my function), and works as expected:

When the function is consuming a single EventData, and I stopped function the checkpoint was not set. This means when I started Azure Function once again, I was consuming once again the event that was not completely processed in my previous run. Exactly, the last, not checkpointed batch of the events was consumed again. As expected.

One issue I found with that:

In my original function implementation, I had registration on that CancellationToken to force Flush Telemetry.

 [FunctionName(nameof(MyInboxFunction))]
    [FixedDelayRetry(-1, "00:00:03")]
    public async Task Run(
        [EventHubTrigger(MyHubName, Connection = MyHubConnection, ConsumerGroup = ConsumerGroupName)]
        EventData eventData,
        CancellationToken cancellationToken
    )
    {
       
        _ = eventData ?? throw new ArgumentNullException(nameof(eventData));

        using (cancellationToken.Register(() =>
               {
                   _telemetryClient.TrackEvent($"{nameof(PayProxyInboxFunction)} stopped");
                   _telemetryClient.Flush();
               }))
        {

and for that, the linkedCts.IsCancellationRequested from the referenced code has no proper state - the Checkpoint was run:

MicrosoftTeams-image (1)

I cannot explain that. I spent a few hours debugging why the linkedCts.IsCancellationRequested has no value true, so I'm sharing my finding for you who updated Microsoft.Azure.WebJobs.Extensions.EventHubs to 6.0.1 and still facing the issue. I have removed that Registration, and started working

pbukowski-pairsoft avatar Nov 14 '23 08:11 pbukowski-pairsoft