azure-functions-host
azure-functions-host copied to clipboard
Improve host retry behaviour
As part of some of the flex work happening, we recently made a change in the host to capture an "AppFailure" count when there is a "permanent" app failure in the host - this typically ends up happening when we see an exception in the host startup:
https://github.com/Azure/azure-functions-host/blob/726c20c29b3433097226f3742eb9b3297f171e6e/src/WebJobs.Script.WebHost/WebJobsScriptHostService.cs#L399-L423
But when this kind of failure occurs, we retry to start the host. The host as an indefinite exponential retry (up to a max of 2 minutes), this was required as the platform couldn't react efficiently to unhealthy instances. However today, we have a different platform that is aware of the permanent app failures and is able to manage instances better than before. As Fabio mentioned, "we should have behavior consistent with the signals we emit. If we state we're in a permanent failure state, we our logic should probably match that."
Goal:
We should look into what the platform is capable of today in terms of instance management, and make changes to how the host retries:
- Should we cap out retries to a specific number?
- i.e. If we retry 5 times and we are still not able to restart the host, emit a permanent failure count here then stop retrying
- Should we make the retry threshold configurable?
To help support any design decisions made here, we should look at logs to see how often we see the host coming back after 3 attempts vs how often it retries indefinitely.