azure-functions-host remove RX as the primary pipe for gRPC communications

remove RX as the primary pipe for gRPC communications

Open mgravell opened this issue 2 years ago • 15 comments

Issue describing the changes in this PR

resolves #8280; performance locally is a 5x+ speedup in throughput

the main crux of the new flow is described in src/WebJobs.Script.Grpc/Eventing/GrpcEventExtensions.cs, but in short: we establish a per-worker Channel<InboundGrpcEvent> and Channel<OutboundGrpcEvent> to handle the backlog; instead of publishing to RX, we publish to the writer of the relevant queue. Separately, we have an async worker deque data from the queues, and process accordingly. This is much more direct, avoids a lot of RX machinery, and creates isolation between workers.

Pull request checklist

[ ] My changes do not require documentation changes
- [ ] Otherwise: Documentation issue linked to PR
[ ] My changes should not be added to the release notes for the next release
- [ ] Otherwise: I've added my notes to release_notes.md
[ ] My changes do not need to be backported to a previous version
- [ ] Otherwise: Backport tracked by issue/PR #issue_or_pr
[ ] I have added all required tests (Unit tests, E2E tests)

Additional information

Additional PR information

Apr 04 '22 13:04 mgravell

@mgravell apologies for the delay on this. We'll get eyes on this ASAP. Would it be possible to have this rebased to deal with the conflicts?

Thanks!

Apr 29 '22 21:04 fabiocav

Will do

On Fri, 29 Apr 2022, 22:21 Fabio Cavalcante, @.***> wrote:

@mgravell https://github.com/mgravell apologies for the delay on this. We'll get eyes on this ASAP. Would it be possible to have this rebased to deal with the conflicts?

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/Azure/azure-functions-host/pull/8281#issuecomment-1113745563, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEHMENKREBPYEY25NFYVTVHRHEHANCNFSM5SPTV72Q . You are receiving this because you were mentioned.Message ID: @.***>

Apr 30 '22 10:04 mgravell

Tagging @safihamid @paulbatum - fyi

@fabiocav / @brettsam - would be good to get test E2E on private stamp before merging this PR

Apr 30 '22 23:04 pragnagopa

@fabiocav I believe that's merged; it looks like some cheese moved around the topic of "function load collections", which made the tests merge a bit messy, but: should work

May 06 '22 13:05 mgravell

@mgravell anything we can do to help you get this PR merged?

Jun 15 '22 00:06 liliankasem

@mgravell indeed, this is high priority on our side, so whatever we can do to help, please let us know.

Jun 16 '22 20:06 fabiocav

I looked at this earlier on a call - it looks like the main pending item is to test regressions in cold start and not code tidy here as the blocker, or am I missing a thread where that's already in play? Happy to help on the code side.

Jun 16 '22 20:06 NickCraver

I looked at this earlier on a call - it looks like the main pending item is to test regressions in cold start and not code tidy here as the blocker, or am I missing a thread where that's already in play? Happy to help on the code side.

I think it's a mix of both, would be great to get PR feedback addressed but not a huge blocker. @safihamid and @pragnagopa can you help with cold start testing?

Jun 16 '22 20:06 liliankasem

I looked at this earlier on a call - it looks like the main pending item is to test regressions in cold start and not code tidy here as the blocker, or am I missing a thread where that's already in play? Happy to help on the code side.

I think it's a mix of both, would be great to get PR feedback addressed but not a huge blocker. @safihamid and @pragnagopa can you help with cold start testing?

Ok I will get cold start comparison numbers for this by end of next week. I hope most PR feedbacks are addressed by then.

Jun 16 '22 22:06 safihamid

Merge is in progress; fighting SendInvocationRequest_ValidateTraceContext and SendInvocationRequest_ValidateTraceContext_SessionId currently

Jun 21 '22 15:06 mgravell

No noticeable cold start changes with this PR in my private stamp as long as we are using the right language worker during placeholder mode. i.e. use Node placeholder for Node app, ...

Jun 22 '22 20:06 safihamid

I ran a few throughput tests in production using an App Service Plan running custom site extensions with and without the changes in this PR.

TL;DR - while the gains for single core (P1V2) are marginal, the gains for multicore setups (P3V2 is 4-core) are significant, can be as high as 50% 🚀 🥇

Configuration	Throughput (dev)	Throughput (remove-rx)
P1V2, 100 connections	401 RPS	416 RPS
P3V2, 400 connections	1042 RPS	1581 RPS

These tests were run with the following configuration:

30 second test runs
workload is NET isolated anonymous HTTP triggered function that just returns a 200 response
host.json is modified to reduce the amount of logging
load is generated using bombadier e.g. d:\tools\bombardier-windows-amd64.exe --insecure -c 400 -r 2400 -d 30s -m POST --format=pt -p i,r -l https://hostname.azurewebsites.net/api/function1

dev branch latency distribution -

Bombarding https://pbatum-rx-test-isolated.azurewebsites.net:443/api/function1 for 30s using 400 connection(s)
Statistics        Avg      Stdev        Max
  Reqs/sec      1042.60     200.06    1799.95
  Latency      380.22ms    36.21ms      1.28s
  Latency Distribution
     50%   376.40ms
     75%   388.20ms
     90%   408.00ms
     95%   429.02ms
     99%   506.01ms
  HTTP codes:
    1xx - 0, 2xx - 31666, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:   585.32KB/s

no-rx latency distribution -

Bombarding https://pbatum-rx-test-isolated.azurewebsites.net:443/api/function1 for 30s using 400 connection(s)
Statistics        Avg      Stdev        Max
  Reqs/sec      1581.55     940.41    8996.40
  Latency      248.55ms    53.08ms      0.85s
  Latency Distribution
     50%   236.14ms
     75%   277.61ms
     90%   325.00ms
     95%   365.00ms
     99%   474.09ms
  HTTP codes:
    1xx - 0, 2xx - 47739, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:   845.26KB/s

Jul 06 '22 01:07 paulbatum

/azp run

Jul 26 '22 22:07 NickCraver

Azure Pipelines successfully started running 1 pipeline(s).

Jul 26 '22 22:07 azure-pipelines[bot]

Most failures here are due to more merges since the PR - 1 left I couldn't get through tonight:

[xUnit.net 00:03:48.32]     Microsoft.Azure.WebJobs.Script.Tests.WorkerConcurrencyManagerEndToEndTests.WorkerStatus_NewWorkerAdded [FAIL]
  Failed Microsoft.Azure.WebJobs.Script.Tests.WorkerConcurrencyManagerEndToEndTests.WorkerStatus_NewWorkerAdded [2 m]
  Error Message:
   System.ApplicationException : Condition not reached within timeout.
  Stack Trace:
     at Microsoft.Azure.WebJobs.Script.Tests.TestHelpers.Await(Func`1 condition, Int32 timeout, Int32 pollingInterval, Boolean throwWhenDebugging, Func`1 userMessageCallback) in /_/test/WebJobs.Script.Tests.Shared/TestHelpers.cs:line 115
   at Microsoft.Azure.WebJobs.Script.Tests.WorkerConcurrencyManagerEndToEndTests.WorkerStatus_NewWorkerAdded() in /_/test/WebJobs.Script.Tests.Integration/Rpc/WorkerConcurrencyManagerEndToEndTests.cs:line 28
--- End of stack trace from previous location ---
fail: Host.General[515]
      A host error has occurred during startup operation 'fe3a9a1d-e32b-4243-9d9f-2e2d3bcbbfe3'.
      System.ArgumentNullException: Value cannot be null. (Parameter 'provider')
         at Microsoft.Extensions.DependencyInjection.ServiceProviderServiceExtensions.GetService[T](IServiceProvider provider)
         at Microsoft.Azure.WebJobs.Script.ScriptHostBuilderExtensions.<>c__DisplayClass6_0.<AddScriptHost>b__2(HostBuilderContext context, IConfigurationBuilder configBuilder) in /_/src/WebJobs.Script/ScriptHostBuilderExtensions.cs:line 114
         at Microsoft.Extensions.Hosting.HostBuilder.BuildAppConfiguration()
         at Microsoft.Extensions.Hosting.HostBuilder.Build()
         at Microsoft.Azure.WebJobs.Script.WebHost.DefaultScriptHostBuilder.BuildHost(Boolean skipHostStartup, Boolean skipHostConfigurationParsing) in /_/src/WebJobs.Script.WebHost/DefaultScriptHostBuilder.cs:line 59
         at Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.BuildHost(Boolean skipHostStartup, Boolean skipHostJsonConfiguration) in /_/src/WebJobs.Script.WebHost/WebJobsScriptHostService.cs:line 581
         at Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.UnsynchronizedStartHostAsync(ScriptHostStartupOperation activeOperation, Int32 attemptCount, JobHostStartupMode startupMode) in /_/src/WebJobs.Script.WebHost/WebJobsScriptHostService.cs:line 276

I'm not setup for end-to-end on laptop so can't debug readily but something is amiss there with services - will poke more tomorrow.

Jul 27 '22 02:07 NickCraver

Gentle reminder we should try and get this in before more conflicts pile up against the old script manager approach.

Sep 02 '22 11:09 NickCraver

Let's get it in -- no objections from me. Although I do see conflicts again...

Sep 02 '22 15:09 brettsam

@brettsam merged once more from dev and addressed one test race (existing issue but hey, improvements!)

Sep 02 '22 15:09 NickCraver

Thanks, @NickCraver.

Good to have this merged once conflicts are resolved.

Sep 02 '22 18:09 fabiocav

@fabiocav I can't see merge conflicts local or PR UX, any idea what it's conflicting with?

Sep 02 '22 18:09 NickCraver

@NickCraver odd, I don't see the usual list, just the conflict warning in GH.

If you can merge locally, go for it!

Sep 02 '22 19:09 fabiocav

I beat the GitHub UI!

Sep 02 '22 19:09 NickCraver

azure-functions-host azure-functions-host copied to clipboard

remove RX as the primary pipe for gRPC communications

Issue describing the changes in this PR

Pull request checklist

Additional information

azure-functions-host
azure-functions-host copied to clipboard