aspnetcore icon indicating copy to clipboard operation
aspnetcore copied to clipboard

Remove some `GCHandle`s from `SendHeaders()`

Open AaronRobinsonMSFT opened this issue 3 years ago • 3 comments
trafficstars

Description

This introduces a new IBufferWriter<T> implementation that manages unmanaged block allocations. The intent here was to remove the allocation of handles and temporary managed arrays allocations and and replace them with native memory.

Instead of using an unmanaged memory allocator like NativeMemory, the GC.AllocateUninitializedArray() could be used to create pinned memory. I've rejected this because it puts the onus on the GC to collect the memory whereas with the unmanaged memory it can be freed precisely when it is no longer needed. Since the pinned heap is presently considered only on a gen2 collection it leaves this unused memory around far longer than needed and creates unnecessary pressure.

AaronRobinsonMSFT avatar Oct 14 '22 23:10 AaronRobinsonMSFT

/benchmark list

AaronRobinsonMSFT avatar Oct 15 '22 04:10 AaronRobinsonMSFT

Crank Pull Request Bot

/benchmark <benchmarks[,...]> <profiles[,...]> <components,[...]>

Benchmarks:

  • plaintext: TechEmpower Plaintext Scenario - ASP.NET Platform implementation
  • json: TechEmpower JSON Scenario - ASP.NET Platform implementation
  • fortunes: TechEmpower Fortunes Scenario - ASP.NET Platform implementation
  • yarp: YARP - http-http with 10 bytes
  • mvcjsoninput2k: Sends 2Kb Json Body to an MVC controller

Profiles:

  • aspnet-perf-lin: INTEL/Linux 12 Cores
  • aspnet-perf-win: INTEL/Windows 12 Cores
  • aspnet-citrine-lin: INTEL/Linux 28 Cores
  • aspnet-citrine-win: INTEL/Windows 28 Cores
  • aspnet-citrine-arm: ARM/Linux 32 Cores
  • aspnet-citrine-amd: AMD/Linux 48 Cores

Components:

  • kestrel
  • mvc

pr-benchmarks[bot] avatar Oct 15 '22 04:10 pr-benchmarks[bot]

Http.sys isn't in this list. I recently did a perf change to http.sys and had to build locally and run a benchmark. We should add the JSON one.

cc @sebastienros @adityamandaleeka @Tratcher

davidfowl avatar Oct 15 '22 05:10 davidfowl

BTW is there an easy way to track alloc/free using counters? Would it even be useful? Or would you switch to native profilers at that point?

davidfowl avatar Oct 17 '22 02:10 davidfowl

You would switch to native profilers to profile unmanaged memory allocators.

jkotas avatar Oct 17 '22 02:10 jkotas

BTW is there an easy way to track alloc/free using counters? Would it even be useful? Or would you switch to native profilers at that point?

The VS profiler has an unmanaged memory allocation tracker. /cc @karpinsn

AaronRobinsonMSFT avatar Oct 17 '22 14:10 AaronRobinsonMSFT

Fixes/helps https://github.com/dotnet/aspnetcore/issues/40215?

BrennanConroy avatar Oct 17 '22 15:10 BrennanConroy

Fixes/helps #40215?

Oh wow. Yes! So glad there is an issue filed for this. I'll mark this PR as fixing that.

AaronRobinsonMSFT avatar Oct 17 '22 15:10 AaronRobinsonMSFT

I think this is a good first pass to measure performance. Some potential changes once we get through this:

  • Managed pointers
  • Stackalloc for known headers when under a certain size
  • Stackalloc for data chunks when under a certain size

davidfowl avatar Oct 17 '22 16:10 davidfowl

Benchmark for current changes.

| application                             | baseline                 | Current                  |         |
| --------------------------------------- | ------------------------ | ------------------------ | ------- |
| CPU Usage (%)                           |                       64 |                       67 |  +4.69% |
| Cores usage (%)                         |                    1,788 |                    1,880 |  +5.15% |
| Working Set (MB)                        |                      484 |                      486 |  +0.41% |
| Private Memory (MB)                     |                      514 |                      530 |  +3.11% |
| Build Time (ms)                         |                    6,666 |                    3,886 | -41.70% |
| Start Time (ms)                         |                      328 |                      344 |  +4.88% |
| Published Size (KB)                     |                  118,659 |                  118,659 |   0.00% |
| .NET Core SDK Version                   | 8.0.100-alpha.1.22514.14 | 8.0.100-alpha.1.22514.14 |         |
| Max CPU Usage (%)                       |                       67 |                       65 |  -2.51% |
| Max Working Set (MB)                    |                      501 |                      508 |  +1.40% |
| Max GC Heap Size (MB)                   |                      255 |                      257 |  +0.73% |
| Size of committed memory by the GC (MB) |                      452 |                      456 |  +0.77% |
| Max Number of Gen 0 GCs / sec           |                     8.00 |                     7.00 | -12.50% |
| Max Number of Gen 1 GCs / sec           |                     2.00 |                     2.00 |   0.00% |
| Max Number of Gen 2 GCs / sec           |                     1.00 |                     1.00 |   0.00% |
| Max Time in GC (%)                      |                     1.00 |                     1.00 |   0.00% |
| Max Gen 0 Size (B)                      |               36,651,360 |               41,094,992 | +12.12% |
| Max Gen 1 Size (B)                      |                6,313,496 |                8,741,000 | +38.45% |
| Max Gen 2 Size (B)                      |                1,511,712 |                1,463,744 |  -3.17% |
| Max LOH Size (B)                        |                  113,184 |                  113,184 |   0.00% |
| Max POH Size (B)                        |                1,773,536 |                  825,936 | -53.43% |
| Max Allocation Rate (B/sec)             |            1,687,780,712 |            1,760,190,512 |  +4.29% |
| Max GC Heap Fragmentation               |                       90 |                       91 |  +1.19% |
| # of Assemblies Loaded                  |                       88 |                       88 |   0.00% |
| Max Exceptions (#/s)                    |                        0 |                        0 |         |
| Max Lock Contention (#/s)               |                    1,078 |                    1,400 | +29.87% |
| Max ThreadPool Threads Count            |                       48 |                       48 |   0.00% |
| Max ThreadPool Queue Length             |                      178 |                      134 | -24.72% |
| Max ThreadPool Items (#/s)              |                1,319,545 |                1,130,717 | -14.31% |
| Max Active Timers                       |                        0 |                        0 |         |
| IL Jitted (B)                           |                  117,381 |                  156,942 | +33.70% |
| Methods Jitted                          |                    1,315 |                    2,083 | +58.40% |


| load                   | baseline  | Current   |         |
| ---------------------- | --------- | --------- | ------- |
| CPU Usage (%)          |        40 |        41 |  +2.50% |
| Cores usage (%)        |     1,108 |     1,149 |  +3.70% |
| Working Set (MB)       |        49 |        48 |  -2.04% |
| Private Memory (MB)    |       376 |       376 |   0.00% |
| Start Time (ms)        |         0 |         0 |         |
| First Request (ms)     |       215 |       243 | +13.02% |
| Requests/sec           |   401,248 |   423,907 |  +5.65% |
| Requests               | 6,058,808 | 6,400,799 |  +5.64% |
| Mean latency (ms)      |      0.69 |      0.62 | -10.17% |
| Max latency (ms)       |     24.05 |     22.32 |  -7.19% |
| Bad responses          |         0 |         0 |         |
| Socket errors          |         0 |         0 |         |
| Read throughput (MB/s) |     63.52 |     67.11 |  +5.65% |
| Latency 50th (ms)      |      0.58 |      0.56 |  -3.76% |
| Latency 75th (ms)      |      0.78 |      0.74 |  -4.77% |
| Latency 90th (ms)      |      0.99 |      0.96 |  -3.03% |
| Latency 99th (ms)      |      2.20 |      1.72 | -21.82% |

AaronRobinsonMSFT avatar Oct 17 '22 20:10 AaronRobinsonMSFT

/cc @davidfowl

Benchmarks against .NET 8

| application                             | baseline                 | Current                  |         |
| --------------------------------------- | ------------------------ | ------------------------ | ------- |
| CPU Usage (%)                           |                       66 |                       67 |  +1.52% |
| Cores usage (%)                         |                    1,852 |                    1,883 |  +1.67% |
| Working Set (MB)                        |                      479 |                      473 |  -1.25% |
| Private Memory (MB)                     |                      505 |                      499 |  -1.19% |
| Build Time (ms)                         |                    6,445 |                    4,194 | -34.93% |
| Start Time (ms)                         |                      334 |                      355 |  +6.29% |
| Published Size (KB)                     |                  115,430 |                  115,430 |   0.00% |
| .NET Core SDK Version                   | 8.0.100-alpha.1.22514.14 | 8.0.100-alpha.1.22514.14 |         |
| Max CPU Usage (%)                       |                       68 |                       68 |  +0.68% |
| Max Working Set (MB)                    |                      502 |                      494 |  -1.57% |
| Max GC Heap Size (MB)                   |                      312 |                      332 |  +6.19% |
| Size of committed memory by the GC (MB) |                      451 |                      450 |  -0.25% |
| Max Number of Gen 0 GCs / sec           |                     6.00 |                     6.00 |   0.00% |
| Max Number of Gen 1 GCs / sec           |                     1.00 |                     1.00 |   0.00% |
| Max Number of Gen 2 GCs / sec           |                     0.00 |                     0.00 |         |
| Max Time in GC (%)                      |                     1.00 |                     1.00 |   0.00% |
| Max Gen 0 Size (B)                      |               46,443,032 |               25,880,352 | -44.28% |
| Max Gen 1 Size (B)                      |                6,664,304 |                6,506,072 |  -2.37% |
| Max Gen 2 Size (B)                      |                1,357,992 |                1,348,240 |  -0.72% |
| Max LOH Size (B)                        |                  113,184 |                  113,184 |   0.00% |
| Max POH Size (B)                        |                  839,320 |                  822,840 |  -1.96% |
| Max Allocation Rate (B/sec)             |            1,858,940,728 |            1,778,665,328 |  -4.32% |
| Max GC Heap Fragmentation               |                       92 |                       91 |  -1.51% |
| # of Assemblies Loaded                  |                       88 |                       88 |   0.00% |
| Max Exceptions (#/s)                    |                        0 |                        0 |         |
| Max Lock Contention (#/s)               |                    1,390 |                    1,306 |  -6.04% |
| Max ThreadPool Threads Count            |                       48 |                       48 |   0.00% |
| Max ThreadPool Queue Length             |                      138 |                      133 |  -3.62% |
| Max ThreadPool Items (#/s)              |                1,190,411 |                1,173,617 |  -1.41% |
| Max Active Timers                       |                        0 |                        0 |         |
| IL Jitted (B)                           |                  116,702 |                  154,697 | +32.56% |
| Methods Jitted                          |                    1,316 |                    2,064 | +56.84% |


| load                   | baseline  | Current   |         |
| ---------------------- | --------- | --------- | ------- |
| CPU Usage (%)          |        42 |        42 |   0.00% |
| Cores usage (%)        |     1,186 |     1,180 |  -0.51% |
| Working Set (MB)       |        49 |        49 |   0.00% |
| Private Memory (MB)    |       376 |       376 |   0.00% |
| Start Time (ms)        |         0 |         0 |         |
| First Request (ms)     |       240 |       239 |  -0.42% |
| Requests/sec           |   431,565 |   433,502 |  +0.45% |
| Requests               | 6,516,506 | 6,545,631 |  +0.45% |
| Mean latency (ms)      |      0.75 |      0.66 | -11.59% |
| Max latency (ms)       |     23.46 |     32.78 | +39.73% |
| Bad responses          |         0 |         0 |         |
| Socket errors          |         0 |         0 |         |
| Read throughput (MB/s) |     68.32 |     68.63 |  +0.45% |
| Latency 50th (ms)      |      0.55 |      0.55 |  +0.37% |
| Latency 75th (ms)      |      0.73 |      0.73 |  -0.82% |
| Latency 90th (ms)      |      0.97 |      0.95 |  -2.06% |
| Latency 99th (ms)      |      8.44 |      2.81 | -66.71% |

AaronRobinsonMSFT avatar Oct 18 '22 18:10 AaronRobinsonMSFT

/cc @davidfowl

Benchmarks against .NET 8

| application                             | baseline                 | Current                  |         |
| --------------------------------------- | ------------------------ | ------------------------ | ------- |
| CPU Usage (%)                           |                       65 |                       72 | +10.77% |
| Cores usage (%)                         |                    1,816 |                    2,014 | +10.90% |
| Working Set (MB)                        |                      469 |                      472 |  +0.64% |
| Private Memory (MB)                     |                      495 |                      498 |  +0.61% |
| Build Time (ms)                         |                    3,746 |                    3,667 |  -2.11% |
| Start Time (ms)                         |                      382 |                      360 |  -5.76% |
| Published Size (KB)                     |                  115,434 |                  115,434 |   0.00% |
| .NET Core SDK Version                   | 8.0.100-alpha.1.22514.14 | 8.0.100-alpha.1.22514.14 |         |
| Max CPU Usage (%)                       |                       64 |                       75 | +16.34% |
| Max Working Set (MB)                    |                      489 |                      494 |  +1.01% |
| Max GC Heap Size (MB)                   |                      333 |                      334 |  +0.29% |
| Size of committed memory by the GC (MB) |                      441 |                      448 |  +1.60% |
| Max Number of Gen 0 GCs / sec           |                     6.00 |                     6.00 |   0.00% |
| Max Number of Gen 1 GCs / sec           |                     1.00 |                     1.00 |   0.00% |
| Max Number of Gen 2 GCs / sec           |                     0.00 |                     0.00 |         |
| Max Time in GC (%)                      |                     1.00 |                     0.00 |         |
| Max Gen 0 Size (B)                      |               46,476,784 |               33,895,144 | -27.07% |
| Max Gen 1 Size (B)                      |                6,111,888 |                6,507,688 |  +6.48% |
| Max Gen 2 Size (B)                      |                1,354,112 |                1,341,880 |  -0.90% |
| Max LOH Size (B)                        |                  113,184 |                  113,184 |   0.00% |
| Max POH Size (B)                        |                  835,200 |                  826,960 |  -0.99% |
| Max Allocation Rate (B/sec)             |            1,837,387,536 |            1,743,068,656 |  -5.13% |
| Max GC Heap Fragmentation               |                       94 |                       91 |  -3.98% |
| # of Assemblies Loaded                  |                       88 |                       88 |   0.00% |
| Max Exceptions (#/s)                    |                        0 |                        0 |         |
| Max Lock Contention (#/s)               |                    1,357 |                    1,285 |  -5.31% |
| Max ThreadPool Threads Count            |                       48 |                       48 |   0.00% |
| Max ThreadPool Queue Length             |                      137 |                      134 |  -2.19% |
| Max ThreadPool Items (#/s)              |                1,154,303 |                1,122,875 |  -2.72% |
| Max Active Timers                       |                        0 |                        0 |         |
| IL Jitted (B)                           |                  157,134 |                  154,667 |  -1.57% |
| Methods Jitted                          |                    2,082 |                    2,064 |  -0.86% |


| load                   | baseline  | Current   |         |
| ---------------------- | --------- | --------- | ------- |
| CPU Usage (%)          |        42 |        42 |   0.00% |
| Cores usage (%)        |     1,185 |     1,165 |  -1.69% |
| Working Set (MB)       |        49 |        49 |   0.00% |
| Private Memory (MB)    |       376 |       376 |   0.00% |
| Start Time (ms)        |         0 |         0 |         |
| First Request (ms)     |       235 |       237 |  +0.85% |
| Requests/sec           |   429,490 |   423,980 |  -1.28% |
| Requests               | 6,484,821 | 6,401,831 |  -1.28% |
| Mean latency (ms)      |      0.70 |      0.67 |  -4.55% |
| Max latency (ms)       |     22.15 |     37.66 | +70.02% |
| Bad responses          |         0 |         0 |         |
| Socket errors          |         0 |         0 |         |
| Read throughput (MB/s) |     67.99 |     67.12 |  -1.28% |
| Latency 50th (ms)      |      0.55 |      0.56 |  +1.64% |
| Latency 75th (ms)      |      0.73 |      0.75 |  +1.63% |
| Latency 90th (ms)      |      0.97 |      0.99 |  +2.06% |
| Latency 99th (ms)      |      5.50 |      2.83 | -48.55% |

AaronRobinsonMSFT avatar Oct 18 '22 20:10 AaronRobinsonMSFT

LGTM!

jkotas avatar Oct 18 '22 22:10 jkotas

@davidfowl @Tratcher @JamesNK Please take another look. The CI is all green.

AaronRobinsonMSFT avatar Oct 19 '22 01:10 AaronRobinsonMSFT

So, it's a bit slower or a wash and the GC numbers look about the same. 🤔

davidfowl avatar Oct 19 '22 03:10 davidfowl

Sharing offline comments I made based on collection of a nettrace for GCHandle usage.

I'm convinced on this change. The logging itself is a win. The number of GCHandle events is staggering. I did a quick experiment locally. Instead of using NativeMemory.Alloc and NativeMemory.Free, I replaced them with ArrayPool<byte>.Shared.Rent and a pinned GCHandle. The trace resulted in precisely the same experience before my larger changes. This is a win and the performance characteristics that we've improved aren't reflected in the crank base statistics. They are however reflected in nettrace.

The nettrace impact is prior to this change the nettrace was 1GB, but after this change it is approximately 40MB. The win here is reduction of the impact of GCHandle events.

AaronRobinsonMSFT avatar Oct 20 '22 17:10 AaronRobinsonMSFT

@sebastienros I want to assign this to you. Once we have benchmarks running on .NET 8 we should merge this.

davidfowl avatar Oct 20 '22 18:10 davidfowl

@davidfowl we are tracking net8.0 now (check the charts) so I assume that's what you wanted before we merge the PR.

sebastienros avatar Oct 24 '22 04:10 sebastienros

Yep

davidfowl avatar Oct 24 '22 05:10 davidfowl

The nettrace impact is prior to this change the nettrace was 1GB, but after this change it is approximately 40MB. The win here is reduction of the impact of GCHandle events.

Could you share what nettrace args are needed to show this? I'm thinking ahead to next year when we write a ASP.NET Core 8 perf improvements blog post and it's useful to have info on the PR so we don't need to figure out how to repro things a year later.

BrennanConroy avatar Oct 24 '22 18:10 BrennanConroy

@BrennanConroy This is easy to see by adding the following arguments to the crank command: --application.dotnetTrace true --application.dotnetTraceProviders gc-verbose. Running scenarios where a lot of GCHandles instance are created/destroyed will saturate the pipeline and result in dropped events or at best, very large nettrace files.

AaronRobinsonMSFT avatar Oct 24 '22 18:10 AaronRobinsonMSFT

Hi @AaronRobinsonMSFT. It looks like you just commented on a closed PR. The team will most probably miss it. If you'd like to bring something important up to their attention, consider filing a new issue and add enough details to build context.

ghost avatar Oct 24 '22 18:10 ghost