azure-functions-host icon indicating copy to clipboard operation
azure-functions-host copied to clipboard

Function App restart seems to not wait for shutdown

Open jsquire opened this issue 4 months ago • 2 comments

Issue Transfer

This issue has been transferred from the Azure SDK for .NET repository, #52057.

Please be aware that @JoostLambregts is the author of the original issue and include them for any questions or replies.

Azure SDK Notes

The original issue raises two concerns:

  • It does not appear that all scenarios for restarting a function application (example, portal restarts) wait for normal shutdown to fully complete. For the Event Hubs listener, this causes partitions to be orphaned and not properly relinquished leading to delayed processing.

  • The host does not assign a stable identifier to the Event Hubs listener, which causes the identity of the underlying processor to change during restarts, node migrations, deployment slot swaps and other scenarios. This results in orphaned partitions leading to delayed processing.

    An important note here is that the host is currently is unable to do so today because the property is not exposed. Fixing this requires a feature addition currently tracked by #52057.

  • The proposed solution in the original issue of adjusting load balancing properties is not recommended due to increased risk of ownership drift and instability in a Functions environment where nodes/scaling are dynamic that increase risk of processing rewinds and duplication.

Details

We have an Azure Function on a dedicated App Service Plan with three nodes, which consumes an Event Hub with 3 partitions. Whenever one of the function nodes restarts, it takes two minutes before the partition gets consumed again. When the entire function restarts (so all nodes), no events are consumed for a full two minutes.

What I think is happening is that the partition lease doesn't get released when the function instance shuts down, and then when the function starts up again the OwnerId that is used for the lease has changed. This then causes the function to wait until PartitionOwnershipExpirationInterval has expired before the newly started function instances steal the leases back.

Part of the issue seems to be that the partition leases don't get released in some situations where they really should be (in the case of a function restart from the Azure Portal for instance). Even if this would be fixed however, there might still be situations where the function crashes before the partition lease can be released. Therefore we would really like to be able to set PartitionOwnershipExpirationInterval to a lower value.

Another option could maybe be setting the ownerId to a value that is unique per host instance / deployment slot combination and is preserved over restarts. No idea if this is possible / has unwanted side effects though.

jsquire avatar Aug 18 '25 18:08 jsquire

We found a workaround. When we use the Kafka trigger instead of the Event Hubs trigger to consume messages from Event Hubs, we experience way less downtime during both function restarts and deployments. Of course it isn't pretty that we need to use the Kafka interface instead of the dedicated Event Hubs trigger, but at least this makes this issue less of a priority for us.

JoostLambregts avatar Sep 22 '25 14:09 JoostLambregts

Some additional information: we have only encountered this issue on Linux app service plans.

JoostLambregts avatar Oct 26 '25 09:10 JoostLambregts