Do not shutdown Host on hard failures that need user actions
-
For issues with host-id-collisions -
Functions Host is shutdown here:
https://github.com/Azure/azure-functions-host/blob/a14c989a5f1d736478f3c4d4be9b2b5ea4dda6ef/src/WebJobs.Script/Host/HostIdValidator.cs#L129
-
For issues with App Content Initialization on the platform -
Shutting down host due to presence of C:\home\site\wwwroot\FAILED TO INITIALIZE RUN FROM PACKAGE.txtFunctions Host is shutdown here:
https://github.com/Azure/azure-functions-host/blob/a14c989a5f1d736478f3c4d4be9b2b5ea4dda6ef/src/WebJobs.Script.WebHost/WebJobsScriptHostService.cs#L204
Both these are fatal errors i.e. User needs to take action to fix the app content / hostId.
Ask: Instead of shutting down the host, leave host running in Error State - this will allow
- Platform components such as Scale Controller to ping the functions host status end point and ensure scaling requests are not made beyond one worker
- As of now DWAS continues to restart the functions host - this fix will avoid cycles in DWAS to attempt restarts on hard failure
Tagging @fabiocav / @mathewc for input.
cc @glennamanns @balag0 @chiangvincent fyi
There are several other situations as well in addition to those you list above where the host determines there's an unrecoverable error and calls StopApplication. Now some of these as you point out are configuration issues the user will need to fix, while others may be environmental/temporary and we're restarting in an attempt to recover. For issues in the latter category, I think the possibility of the host undergoing constant recycles will continue, so it seems we'll need a general solution for this that is outside of the host. E.g. it may not even be the host itself shutting things down explicitly, but it may be crashing. How do we deal with those cases and avoid scale out?
Yes, we were concerned about those other scenarios too. I think we see enough CRIs that maybe we should do a pointed change targeting just this common unrecoverable scenario and not try to solve for a general case as it seems more complicated?
Thanks for the feedback @mathewc - As @balag0 mentioned scoping the work to the scenarios pointed above is a good starting point. Overtime we can expand checks for "recoverable" errors where host restarts as its built today will continue to work but any fatal errors that require user action - status API should reflect the error - this will help platform and the Cx to take action based on the status endpoint result.
Assigning this to @brettsam as part of a collection of similar issues he is currently investigating.
@balag0 / @pragnagopa / @mathewc -- can you tell me where the scale controller looks to tell if there's an error state? Is it /admin/host/ping?
@kashimiz and I investigated issue #9059 and realized that /admin/host/ping doesn't seem to consult the ScriptHostState when getting scale controller pings. Instead, it checks if the site is under load and returns. It feels like we need to adjust that behavior as part of this.
It seems like if that perf check comes back okay that we should then proceed to check the host status and return that. I was going to start working on this (and other related things) but wanted to consult with you folks first.
https://github.com/Azure/azure-functions-host/blob//016d6b03fc115da5e2ebea6caa7e4ad37140a7f7/src/WebJobs.Script.WebHost/Controllers/HostController.cs#L216
https://github.com/Azure/azure-functions-host/blob/016d6b03fc115da5e2ebea6caa7e4ad37140a7f7/src/WebJobs.Script/Scale/HostPerformanceManager.cs#L66
@balag0 / @pragnagopa can you please look at @brettsam 's question above?
Apologies for the delayed response!
As of now SC only pings /admin/host/ping and can only detect the worker ping is timing out. Please reach out offline to discuss more details as needed.
@brettsam will schedule an offline discussion on this