azure-functions-host
azure-functions-host copied to clipboard
remove RX as the primary pipe for gRPC communications
Issue describing the changes in this PR
resolves #8280; performance locally is a 5x+ speedup in throughput
the main crux of the new flow is described in src/WebJobs.Script.Grpc/Eventing/GrpcEventExtensions.cs, but in short: we establish a per-worker Channel<InboundGrpcEvent>
and Channel<OutboundGrpcEvent>
to handle the backlog; instead of publishing to RX, we publish to the writer of the relevant queue. Separately, we have an async worker deque data from the queues, and process accordingly. This is much more direct, avoids a lot of RX machinery, and creates isolation between workers.
Pull request checklist
- [ ] My changes do not require documentation changes
- [ ] Otherwise: Documentation issue linked to PR
- [ ] My changes should not be added to the release notes for the next release
- [ ] Otherwise: I've added my notes to
release_notes.md
- [ ] Otherwise: I've added my notes to
- [ ] My changes do not need to be backported to a previous version
- [ ] Otherwise: Backport tracked by issue/PR #issue_or_pr
- [ ] I have added all required tests (Unit tests, E2E tests)
Additional information
Additional PR information
@mgravell apologies for the delay on this. We'll get eyes on this ASAP. Would it be possible to have this rebased to deal with the conflicts?
Thanks!
Will do
On Fri, 29 Apr 2022, 22:21 Fabio Cavalcante, @.***> wrote:
@mgravell https://github.com/mgravell apologies for the delay on this. We'll get eyes on this ASAP. Would it be possible to have this rebased to deal with the conflicts?
Thanks!
— Reply to this email directly, view it on GitHub https://github.com/Azure/azure-functions-host/pull/8281#issuecomment-1113745563, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEHMENKREBPYEY25NFYVTVHRHEHANCNFSM5SPTV72Q . You are receiving this because you were mentioned.Message ID: @.***>
Tagging @safihamid @paulbatum - fyi
@fabiocav / @brettsam - would be good to get test E2E on private stamp before merging this PR
@fabiocav I believe that's merged; it looks like some cheese moved around the topic of "function load collections", which made the tests merge a bit messy, but: should work
@mgravell anything we can do to help you get this PR merged?
@mgravell indeed, this is high priority on our side, so whatever we can do to help, please let us know.
I looked at this earlier on a call - it looks like the main pending item is to test regressions in cold start and not code tidy here as the blocker, or am I missing a thread where that's already in play? Happy to help on the code side.
I looked at this earlier on a call - it looks like the main pending item is to test regressions in cold start and not code tidy here as the blocker, or am I missing a thread where that's already in play? Happy to help on the code side.
I think it's a mix of both, would be great to get PR feedback addressed but not a huge blocker. @safihamid and @pragnagopa can you help with cold start testing?
I looked at this earlier on a call - it looks like the main pending item is to test regressions in cold start and not code tidy here as the blocker, or am I missing a thread where that's already in play? Happy to help on the code side.
I think it's a mix of both, would be great to get PR feedback addressed but not a huge blocker. @safihamid and @pragnagopa can you help with cold start testing?
Ok I will get cold start comparison numbers for this by end of next week. I hope most PR feedbacks are addressed by then.
Merge is in progress; fighting SendInvocationRequest_ValidateTraceContext
and SendInvocationRequest_ValidateTraceContext_SessionId
currently
No noticeable cold start changes with this PR in my private stamp as long as we are using the right language worker during placeholder mode. i.e. use Node placeholder for Node app, ...
I ran a few throughput tests in production using an App Service Plan running custom site extensions with and without the changes in this PR.
TL;DR - while the gains for single core (P1V2) are marginal, the gains for multicore setups (P3V2 is 4-core) are significant, can be as high as 50% 🚀 🥇
Configuration | Throughput (dev) | Throughput (remove-rx) |
---|---|---|
P1V2, 100 connections | 401 RPS | 416 RPS |
P3V2, 400 connections | 1042 RPS | 1581 RPS |
These tests were run with the following configuration:
- 30 second test runs
- workload is NET isolated anonymous HTTP triggered function that just returns a 200 response
- host.json is modified to reduce the amount of logging
- load is generated using bombadier e.g.
d:\tools\bombardier-windows-amd64.exe --insecure -c 400 -r 2400 -d 30s -m POST --format=pt -p i,r -l https://hostname.azurewebsites.net/api/function1
dev branch latency distribution -
Bombarding https://pbatum-rx-test-isolated.azurewebsites.net:443/api/function1 for 30s using 400 connection(s)
Statistics Avg Stdev Max
Reqs/sec 1042.60 200.06 1799.95
Latency 380.22ms 36.21ms 1.28s
Latency Distribution
50% 376.40ms
75% 388.20ms
90% 408.00ms
95% 429.02ms
99% 506.01ms
HTTP codes:
1xx - 0, 2xx - 31666, 3xx - 0, 4xx - 0, 5xx - 0
others - 0
Throughput: 585.32KB/s
no-rx latency distribution -
Bombarding https://pbatum-rx-test-isolated.azurewebsites.net:443/api/function1 for 30s using 400 connection(s)
Statistics Avg Stdev Max
Reqs/sec 1581.55 940.41 8996.40
Latency 248.55ms 53.08ms 0.85s
Latency Distribution
50% 236.14ms
75% 277.61ms
90% 325.00ms
95% 365.00ms
99% 474.09ms
HTTP codes:
1xx - 0, 2xx - 47739, 3xx - 0, 4xx - 0, 5xx - 0
others - 0
Throughput: 845.26KB/s
/azp run
Azure Pipelines successfully started running 1 pipeline(s).
Most failures here are due to more merges since the PR - 1 left I couldn't get through tonight:
[xUnit.net 00:03:48.32] Microsoft.Azure.WebJobs.Script.Tests.WorkerConcurrencyManagerEndToEndTests.WorkerStatus_NewWorkerAdded [FAIL]
Failed Microsoft.Azure.WebJobs.Script.Tests.WorkerConcurrencyManagerEndToEndTests.WorkerStatus_NewWorkerAdded [2 m]
Error Message:
System.ApplicationException : Condition not reached within timeout.
Stack Trace:
at Microsoft.Azure.WebJobs.Script.Tests.TestHelpers.Await(Func`1 condition, Int32 timeout, Int32 pollingInterval, Boolean throwWhenDebugging, Func`1 userMessageCallback) in /_/test/WebJobs.Script.Tests.Shared/TestHelpers.cs:line 115
at Microsoft.Azure.WebJobs.Script.Tests.WorkerConcurrencyManagerEndToEndTests.WorkerStatus_NewWorkerAdded() in /_/test/WebJobs.Script.Tests.Integration/Rpc/WorkerConcurrencyManagerEndToEndTests.cs:line 28
--- End of stack trace from previous location ---
fail: Host.General[515]
A host error has occurred during startup operation 'fe3a9a1d-e32b-4243-9d9f-2e2d3bcbbfe3'.
System.ArgumentNullException: Value cannot be null. (Parameter 'provider')
at Microsoft.Extensions.DependencyInjection.ServiceProviderServiceExtensions.GetService[T](IServiceProvider provider)
at Microsoft.Azure.WebJobs.Script.ScriptHostBuilderExtensions.<>c__DisplayClass6_0.<AddScriptHost>b__2(HostBuilderContext context, IConfigurationBuilder configBuilder) in /_/src/WebJobs.Script/ScriptHostBuilderExtensions.cs:line 114
at Microsoft.Extensions.Hosting.HostBuilder.BuildAppConfiguration()
at Microsoft.Extensions.Hosting.HostBuilder.Build()
at Microsoft.Azure.WebJobs.Script.WebHost.DefaultScriptHostBuilder.BuildHost(Boolean skipHostStartup, Boolean skipHostConfigurationParsing) in /_/src/WebJobs.Script.WebHost/DefaultScriptHostBuilder.cs:line 59
at Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.BuildHost(Boolean skipHostStartup, Boolean skipHostJsonConfiguration) in /_/src/WebJobs.Script.WebHost/WebJobsScriptHostService.cs:line 581
at Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.UnsynchronizedStartHostAsync(ScriptHostStartupOperation activeOperation, Int32 attemptCount, JobHostStartupMode startupMode) in /_/src/WebJobs.Script.WebHost/WebJobsScriptHostService.cs:line 276
I'm not setup for end-to-end on laptop so can't debug readily but something is amiss there with services - will poke more tomorrow.
Gentle reminder we should try and get this in before more conflicts pile up against the old script manager approach.
Let's get it in -- no objections from me. Although I do see conflicts again...
@brettsam merged once more from dev and addressed one test race (existing issue but hey, improvements!)
Thanks, @NickCraver.
Good to have this merged once conflicts are resolved.
@fabiocav I can't see merge conflicts local or PR UX, any idea what it's conflicting with?
@NickCraver odd, I don't see the usual list, just the conflict warning in GH.
If you can merge locally, go for it!
I beat the GitHub UI!