Fix GrpcWorkerChannel.StartWorkerProcessAsync timeout
Issue describing the changes in this PR
resolves #issue_for_this_pr
Pull request checklist
IMPORTANT: Currently, changes must be backported to the in-proc branch to be included in Core Tools and non-Flex deployments.
- [x] Backporting to the
in-procbranch is not required- Otherwise: Link to backporting PR
- [x] My changes do not require documentation changes
- [ ] Otherwise: Documentation issue linked to PR
- [ ] My changes should not be added to the release notes for the next release
- [ ] Otherwise: I've added my notes to
release_notes.md-- TODO
- [ ] Otherwise: I've added my notes to
- [x] My changes do not need to be backported to a previous version
- [ ] Otherwise: Backport tracked by issue/PR #issue_or_pr
- [x] My changes do not require diagnostic events changes
- Otherwise: I have added/updated all related diagnostic events and their documentation (Documentation issue linked to PR)
- [x] I have added all required tests (Unit tests, E2E tests) -- TODO
Additional information
This PR improves the ScriptHost startup experience with a bad worker. Today, if a worker crashes or exits immediately after startup, then the GrpcWorkerChannel.StartWorkerProcessAsync will block on _workerInitTask.Task until it eventually times out. This tends to fault the entire host (at least during debugging).
To address this, a WorkerProcess.WaitForExitAsync is added and GrpcWorkerChannel.StartWorkerProcessAsync will also wait on that, improving the responsiveness to a worker exiting before connecting gRPC events.
/azp run host.integration-tests
Azure Pipelines successfully started running 1 pipeline(s).
/azp run host.public
Azure Pipelines successfully started running 1 pipeline(s).
@jviau @mattchenderson I was just investigating this yesterday! Turns out this is the same reason that we are stuck for 60s when trying to exit the CLI during startup (https://github.com/Azure/azure-functions-core-tools/issues/4355). Glad to see there is a already a fix before I opened one :) I'm going to test your changes out with core tools and see if it addresses the issue like I expect.
Looks like it works?
2025-07-18T16:13:56.776Z] Reading functions metadata (Worker)
[2025-07-18T16:13:57.920Z] {
[2025-07-18T16:13:57.920Z] "ProcessId": 26145,
[2025-07-18T16:14:00.842Z] "RuntimeIdentifier": "osx-arm64",
[2025-07-18T16:14:00.842Z] "WorkerVersion": "2.0.0.0",
[2025-07-18T16:14:00.842Z] "ProductVersion": "2.0.0\u002Bd8b5fe998a8c92819b8ee41d2569d2525413e9c5",
[2025-07-18T16:14:00.842Z] "FrameworkDescription": ".NET 9.0.7",
[2025-07-18T16:14:00.842Z] "OSDescription": "Darwin 24.5.0 Darwin Kernel Version 24.5.0: Tue Apr 22 19:54:29 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T6030",
[2025-07-18T16:14:00.842Z] "OSArchitecture": "Arm64",
[2025-07-18T16:14:00.842Z] "CommandLine": "/Users/likasem/source/functions/azure-functions-core-tools/_dniapp/bin/output/_dniapp.dll --host 127.0.0.1 --port 50779 --workerId 3246a75a-129e-438f-9c2a-1602804bdd04 --requestId 08ba6582-1eea-4e25-b3ac-cc654b982ecd --grpcMaxMessageLength 2147483647 --functions-uri http://127.0.0.1:50779/ --functions-worker-id 3246a75a-129e-438f-9c2a-1602804bdd04 --functions-request-id 08ba6582-1eea-4e25-b3ac-cc654b982ecd --functions-grpc-max-message-length 2147483647"
[2025-07-18T16:14:00.842Z] }
^C <I hit ctrl c here>
[2025-07-18T16:14:09.119Z] Failed to start language worker process for runtime: dotnet-isolated. workerId:3246a75a-129e-438f-9c2a-1602804bdd04
[2025-07-18T16:14:45.437Z] Removing errored webhost language worker channel for runtime: dotnet-isolated workerId:3246a75a-129e-438f-9c2a-1602804bdd04
[2025-07-18T16:14:45.437Z] System.AggregateException: One or more errors occurred. (Worker process exited before initializing.)
[2025-07-18T16:14:45.437Z] ---> Microsoft.Azure.WebJobs.Script.Workers.WorkerProcessExitException: Worker process exited before initializing.
[2025-07-18T16:14:45.437Z] at Microsoft.Azure.WebJobs.Script.Grpc.GrpcWorkerChannel.StartWorkerProcessAsync(CancellationToken cancellationToken) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs:line 397
[2025-07-18T16:14:45.437Z] --- End of inner exception stack trace ---
[2025-07-18T16:14:45.438Z] at Microsoft.Azure.WebJobs.Script.WorkerFunctionMetadataProvider.GetFunctionMetadataAsync(IEnumerable`1 workerConfigs, Boolean forceRefresh) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Host/WorkerFunctionMetadataProvider.cs:line 117
[2025-07-18T16:15:18.940Z] A host error has occurred during startup operation '393e57d4-33aa-48c8-8b45-6234887f2d7c'.
[2025-07-18T16:15:18.940Z] System.AggregateException: One or more errors occurred. (Worker process exited before initializing.)
[2025-07-18T16:15:18.940Z] ---> Microsoft.Azure.WebJobs.Script.Workers.WorkerProcessExitException: Worker process exited before initializing.
[2025-07-18T16:15:18.940Z] at Microsoft.Azure.WebJobs.Script.Grpc.GrpcWorkerChannel.StartWorkerProcessAsync(CancellationToken cancellationToken) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs:line 397
[2025-07-18T16:15:18.940Z] --- End of inner exception stack trace ---
[2025-07-18T16:15:18.940Z] at Microsoft.Azure.WebJobs.Script.WorkerFunctionMetadataProvider.GetFunctionMetadataAsync(IEnumerable`1 workerConfigs, Boolean forceRefresh) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Host/WorkerFunctionMetadataProvider.cs:line 117
[2025-07-18T16:15:18.940Z] at System.ExceptionExtensions.ThrowIfErrorsPresent(IList`1 exceptions, String message) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Extensions/ExceptionExtensions.cs:line 47
[2025-07-18T16:15:18.940Z] at Microsoft.Azure.WebJobs.Script.WorkerFunctionMetadataProvider.GetFunctionMetadataAsync(IEnumerable`1 workerConfigs, Boolean forceRefresh) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Host/WorkerFunctionMetadataProvider.cs:line 149
[2025-07-18T16:15:18.940Z] at Microsoft.Azure.WebJobs.Script.WebHost.FunctionMetadataProvider.GetFunctionMetadataAsync(IEnumerable`1 workerConfigs, Boolean forceRefresh) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Host/FunctionMetadataProvider.cs:line 45
[2025-07-18T16:15:18.940Z] at Microsoft.Azure.WebJobs.Script.FunctionMetadataManager.LoadFunctionMetadata(Boolean forceRefresh, Boolean includeCustomProviders) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Host/FunctionMetadataManager.cs:line 151
[2025-07-18T16:15:18.940Z] at Microsoft.Azure.WebJobs.Script.FunctionMetadataManager.GetFunctionMetadata(Boolean forceRefresh, Boolean applyAllowList, Boolean includeCustomProviders) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Host/FunctionMetadataManager.cs:line 101
[2025-07-18T16:15:18.940Z] at Microsoft.Azure.WebJobs.Script.DependencyInjection.ExternalConfigurationStartupValidator.Validate(IConfigurationRoot originalConfig) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/DependencyInjection/ExternalConfigurationStartupValidator.cs:line 42
[2025-07-18T16:15:18.940Z] at Microsoft.Azure.WebJobs.Script.DependencyInjection.ExternalConfigurationStartupValidatorService.StartAsync(CancellationToken cancellationToken) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/DependencyInjection/ExternalConfigurationStartupValidatorService.cs:line 33
[2025-07-18T16:15:18.940Z] at Microsoft.Extensions.Hosting.Internal.Host.<StartAsync>b__15_1(IHostedService service, CancellationToken token)
[2025-07-18T16:15:18.940Z] at Microsoft.Extensions.Hosting.Internal.Host.ForeachService[T](IEnumerable`1 services, CancellationToken token, Boolean concurrent, Boolean abortOnFirstException, List`1 exceptions, Func`3 operation)
[2025-07-18T16:15:18.940Z] at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
[2025-07-18T16:15:18.940Z] at Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.UnsynchronizedStartHostAsync(ScriptHostStartupOperation activeOperation, Int32 attemptCount, JobHostStartupMode startupMode) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script.WebHost/WebJobsScriptHostService.cs:line 417
[2025-07-18T16:15:18.945Z] Stopping JobHost
[2025-07-18T16:15:18.965Z] Failed to stop host instance '668de38c-aac1-4cab-98de-2bd581e33a16'.
[2025-07-18T16:15:18.965Z] System.AggregateException: One or more hosted services failed to stop. (The host has not yet started.) (Object reference not set to an instance of an object.)
[2025-07-18T16:15:18.965Z] ---> System.InvalidOperationException: The host has not yet started.
[2025-07-18T16:15:18.965Z] at Microsoft.Azure.WebJobs.JobHost.StopAsync() in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\JobHost.cs:line 119
[2025-07-18T16:15:18.965Z] at Microsoft.Azure.WebJobs.Hosting.JobHostService.StopAsync(CancellationToken cancellationToken) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Hosting\JobHostService.cs:line 32
[2025-07-18T16:15:18.965Z] at Microsoft.Extensions.Hosting.Internal.Host.ForeachService[T](IEnumerable`1 services, CancellationToken token, Boolean concurrent, Boolean abortOnFirstException, List`1 exceptions, Func`3 operation)
[2025-07-18T16:15:18.965Z] --- End of inner exception stack trace ---
[2025-07-18T16:15:18.965Z] at Microsoft.Extensions.Hosting.Internal.Host.StopAsync(CancellationToken cancellationToken)
[2025-07-18T16:15:18.965Z] at Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.Orphan(IHost instance, CancellationToken cancellationToken) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script.WebHost/WebJobsScriptHostService.cs:line 897
[2025-07-18T16:15:18.965Z] ---> (Inner Exception #1) System.NullReferenceException: Object reference not set to an instance of an object.
[2025-07-18T16:15:18.965Z] at Microsoft.Azure.WebJobs.Hosting.OptionsLoggingService.StopAsync(CancellationToken cancellationToken) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Hosting\OptionsFormatter\OptionsLoggingService.cs:line 38
[2025-07-18T16:15:18.965Z] at Microsoft.Extensions.Hosting.Internal.Host.ForeachService[T](IEnumerable`1 services, CancellationToken token, Boolean concurrent, Boolean abortOnFirstException, List`1 exceptions, Func`3 operation)<---
[2025-07-18T16:15:18.965Z]
[2025-07-18T16:15:19.018Z] Host startup operation has been canceled
[2025-07-18T16:15:19.018Z] Initialization cancellation requested by runtime.
[2025-07-18T16:15:19.030Z] Stopping host...
[2025-07-18T16:15:19.031Z] Host shutdown completed.
[2025-07-18T16:15:19.032Z] Shutting down language worker channels for runtime:dotnet-isolated
[2025-07-18T16:15:19.063Z] Stopping the flush logs timer.
Before this change, it would hang for 60 seconds waiting on the timeout from PendingItem.
https://github.com/Azure/azure-functions-host/blob/a20b3070a3246578dd3d2a55f2f23266d484cdf5/src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs#L1748
For my version of the fix, I was considering passing a CT to PendingItem and registering an OnCancelled event so that we do not have to wait the entire timeout period if there is a cancellation event. But I couldn't find the right CT propagate, and CT for lifetime.ApplicationStopping was not being canceled until after the timeout occured.
edit: more logs - this is outside of my debug session - it takes 5 seconds to shutdown after we cancel:
[2025-07-18T16:23:10.832Z] Reading functions metadata (Worker)
^CPress Ctrl+C again to force exit.
[2025-07-18T16:23:11.360Z] Failed to start language worker process for runtime: dotnet-isolated. workerId:15471ba1-bb7d-4a7f-967c-46edb899566c
[2025-07-18T16:23:11.360Z] Language Worker Process exited. Pid=31405.
[2025-07-18T16:23:11.360Z] Microsoft.Azure.WebJobs.Script: dotnet exited with code 134 (0x86). Unhandled exception. System.Threading.Tasks.TaskCanceledException: A task was canceled.
[2025-07-18T16:23:11.360Z] Removing errored webhost language worker channel for runtime: dotnet-isolated workerId:15471ba1-bb7d-4a7f-967c-46edb899566c
[2025-07-18T16:23:11.360Z] Microsoft.Azure.WebJobs.Script: dotnet exited with code 134 (0x86). Unhandled exception. System.Threading.Tasks.TaskCanceledException: A task was canceled.
[2025-07-18T16:23:11.652Z] A host error has occurred during startup operation '790aff1d-8f28-4cb7-8c1d-3b1000394db4'.
[2025-07-18T16:23:11.652Z] Microsoft.Azure.WebJobs.Script: dotnet exited with code 134 (0x86). Unhandled exception. System.Threading.Tasks.TaskCanceledException: A task was canceled.
[2025-07-18T16:23:11.657Z] Stopping JobHost
[2025-07-18T16:23:11.658Z] Failed to stop host instance '45b772e9-bc50-43b5-bd5b-00d6d2faff90'.
[2025-07-18T16:23:11.658Z] Microsoft.Azure.WebJobs.Host: The host has not yet started.
[2025-07-18T16:23:11.671Z] Host startup operation has been canceled
[2025-07-18T16:23:11.671Z] Initialization cancellation requested by runtime.
[2025-07-18T16:23:11.672Z] Stopping host...
[2025-07-18T16:23:11.672Z] Host shutdown completed.
[2025-07-18T16:23:11.673Z] Shutting down language worker channels for runtime:dotnet-isolated
[2025-07-18T16:23:16.576Z] Stopping the flush logs timer.
[2025-07-18T16:23:16.576Z] We couldn’t reach the Table service endpoint specified in the 'AzureWebJobsStorage' setting. We are unable to record diagnostic events, so the diagnostic logging service is being stopped. Please confirm network connectivity and endpoint accessibility.
[2025-07-18T16:23:16.576Z] Azure.Core: Connection refused (127.0.0.1:10002). System.Net.Http: Connection refused (127.0.0.1:10002). System.Net.Sockets: Connection refused.
For core tools at least, I would love to reduce this time but I can probably just inject options in the webhost to reduce the flush logs timer to a 1 second or something like that. Anywho not related to this PR
[2025-07-18T16:23:11.673Z] Shutting down language worker channels for runtime:dotnet-isolated
<this 5 second gap>
[2025-07-18T16:23:16.576Z] Stopping the flush logs timer.
[2025-07-18T16:23:16.576Z] We couldn’t reach the Table service endpoint specified in the 'AzureWebJobsStorage' setting. We are unable to record diagnostic events, so the diagnostic logging service is being stopped. Please confirm network connectivity and endpoint accessibility.
[2025-07-18T16:23:16.576Z] Azure.Core: Connection refused (127.0.0.1:10002). System.Net.Http: Connection refused (127.0.0.1:10002). System.Net.Sockets: Connection refused.