azure-functions-host icon indicating copy to clipboard operation
azure-functions-host copied to clipboard

Do not shutdown Host on hard failures that need user actions

Open pragnagopa opened this issue 3 years ago • 4 comments

  1. For issues with host-id-collisions -

    Functions Host is shutdown here:

    https://github.com/Azure/azure-functions-host/blob/a14c989a5f1d736478f3c4d4be9b2b5ea4dda6ef/src/WebJobs.Script/Host/HostIdValidator.cs#L129

  2. For issues with App Content Initialization on the platform - Shutting down host due to presence of C:\home\site\wwwroot\FAILED TO INITIALIZE RUN FROM PACKAGE.txt

    Functions Host is shutdown here:

    https://github.com/Azure/azure-functions-host/blob/a14c989a5f1d736478f3c4d4be9b2b5ea4dda6ef/src/WebJobs.Script.WebHost/WebJobsScriptHostService.cs#L204

Both these are fatal errors i.e. User needs to take action to fix the app content / hostId.

Ask: Instead of shutting down the host, leave host running in Error State - this will allow

  1. Platform components such as Scale Controller to ping the functions host status end point and ensure scaling requests are not made beyond one worker
  2. As of now DWAS continues to restart the functions host - this fix will avoid cycles in DWAS to attempt restarts on hard failure

pragnagopa avatar Sep 15 '22 00:09 pragnagopa

Tagging @fabiocav / @mathewc for input.

cc @glennamanns @balag0 @chiangvincent fyi

pragnagopa avatar Sep 15 '22 00:09 pragnagopa

There are several other situations as well in addition to those you list above where the host determines there's an unrecoverable error and calls StopApplication. Now some of these as you point out are configuration issues the user will need to fix, while others may be environmental/temporary and we're restarting in an attempt to recover. For issues in the latter category, I think the possibility of the host undergoing constant recycles will continue, so it seems we'll need a general solution for this that is outside of the host. E.g. it may not even be the host itself shutting things down explicitly, but it may be crashing. How do we deal with those cases and avoid scale out?

mathewc avatar Sep 15 '22 21:09 mathewc

Yes, we were concerned about those other scenarios too. I think we see enough CRIs that maybe we should do a pointed change targeting just this common unrecoverable scenario and not try to solve for a general case as it seems more complicated?

balag0 avatar Sep 30 '22 00:09 balag0

Thanks for the feedback @mathewc - As @balag0 mentioned scoping the work to the scenarios pointed above is a good starting point. Overtime we can expand checks for "recoverable" errors where host restarts as its built today will continue to work but any fatal errors that require user action - status API should reflect the error - this will help platform and the Cx to take action based on the status endpoint result.

pragnagopa avatar Oct 03 '22 18:10 pragnagopa

Assigning this to @brettsam as part of a collection of similar issues he is currently investigating.

fabiocav avatar Feb 08 '23 21:02 fabiocav

@balag0 / @pragnagopa / @mathewc -- can you tell me where the scale controller looks to tell if there's an error state? Is it /admin/host/ping?

@kashimiz and I investigated issue #9059 and realized that /admin/host/ping doesn't seem to consult the ScriptHostState when getting scale controller pings. Instead, it checks if the site is under load and returns. It feels like we need to adjust that behavior as part of this.

It seems like if that perf check comes back okay that we should then proceed to check the host status and return that. I was going to start working on this (and other related things) but wanted to consult with you folks first.

https://github.com/Azure/azure-functions-host/blob//016d6b03fc115da5e2ebea6caa7e4ad37140a7f7/src/WebJobs.Script.WebHost/Controllers/HostController.cs#L216

https://github.com/Azure/azure-functions-host/blob/016d6b03fc115da5e2ebea6caa7e4ad37140a7f7/src/WebJobs.Script/Scale/HostPerformanceManager.cs#L66

brettsam avatar Feb 10 '23 16:02 brettsam

@balag0 / @pragnagopa can you please look at @brettsam 's question above?

fabiocav avatar Mar 15 '23 20:03 fabiocav

Apologies for the delayed response!

As of now SC only pings /admin/host/ping and can only detect the worker ping is timing out. Please reach out offline to discuss more details as needed.

pragnagopa avatar Apr 26 '23 21:04 pragnagopa

@brettsam will schedule an offline discussion on this

fabiocav avatar May 31 '23 20:05 fabiocav