aspnetcore icon indicating copy to clipboard operation
aspnetcore copied to clipboard

Reimplement the default object pool for improved performance

Open geeknoid opened this issue 1 year ago • 4 comments

Reimplement the default object pool for improved performance

Rewrite the default pool so that it scales better.

Description

The existing object pool performs very poorly at scale, this new one is considerably better. Attached is a benchmark, here are the results:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22621
Intel Core i7-9700K CPU 3.60GHz (Coffee Lake), 1 CPU, 8 logical and 8 physical cores
.NET SDK=7.0.100
  [Host] : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT

Job=MediumRun  Toolchain=InProcessEmitToolchain  IterationCount=15
LaunchCount=2  WarmupCount=10

|              Method | MaxRetained | OverflowCount |             Mean |             Error |         StdDev |   Gen 0 |  Gen 1 | Allocated |
|-------------------- |------------ |-------------- |-----------------:|------------------:|---------------:|--------:|-------:|----------:|
|         DefaultPool |           2 |             0 |         44.68 ns |          1.764 ns |       0.097 ns |  0.0114 |      - |      72 B |
|          ScaledPool |           2 |             0 |         68.11 ns |         18.306 ns |       1.003 ns |  0.0114 |      - |      72 B |
| DefaultPoolThreaded |           2 |             0 |      1,168.21 ns |        352.497 ns |      19.322 ns |  0.0877 |      - |     552 B |
|  ScaledPoolThreaded |           2 |             0 |      1,264.33 ns |         97.744 ns |       5.358 ns |  0.0877 |      - |     552 B |
|         DefaultPool |           2 |             8 |        248.33 ns |         10.358 ns |       0.568 ns |  0.0520 |      - |     328 B |
|          ScaledPool |           2 |             8 |        320.39 ns |         26.485 ns |       1.452 ns |  0.0520 |      - |     328 B |
| DefaultPoolThreaded |           2 |             8 |      1,542.80 ns |         46.447 ns |       2.546 ns |  0.1640 |      - |   1,026 B |
|  ScaledPoolThreaded |           2 |             8 |      1,975.74 ns |        462.439 ns |      25.348 ns |  0.1602 |      - |   1,017 B |
|         DefaultPool |           2 |            16 |        483.25 ns |        151.602 ns |       8.310 ns |  0.0925 |      - |     584 B |
|          ScaledPool |           2 |            16 |        531.66 ns |         43.929 ns |       2.408 ns |  0.0925 |      - |     584 B |
| DefaultPoolThreaded |           2 |            16 |      2,214.69 ns |        671.443 ns |      36.804 ns |  0.2251 |      - |   1,421 B |
|  ScaledPoolThreaded |           2 |            16 |      2,858.96 ns |        468.512 ns |      25.681 ns |  0.2327 |      - |   1,460 B |
|         DefaultPool |           8 |             0 |        274.24 ns |        130.700 ns |       7.164 ns |  0.0191 |      - |     120 B |
|          ScaledPool |           8 |             0 |        247.56 ns |         34.002 ns |       1.864 ns |  0.0191 |      - |     120 B |
| DefaultPoolThreaded |           8 |             0 |      1,523.70 ns |         46.087 ns |       2.526 ns |  0.1030 |      - |     649 B |
|  ScaledPoolThreaded |           8 |             0 |      1,783.37 ns |        142.507 ns |       7.811 ns |  0.1049 |      - |     662 B |
|         DefaultPool |           8 |             8 |        748.95 ns |         54.202 ns |       2.971 ns |  0.0591 |      - |     376 B |
|          ScaledPool |           8 |             8 |        507.71 ns |         45.306 ns |       2.483 ns |  0.0591 |      - |     376 B |
| DefaultPoolThreaded |           8 |             8 |      2,802.51 ns |        781.559 ns |      42.840 ns |  0.1640 |      - |   1,030 B |
|  ScaledPoolThreaded |           8 |             8 |      3,191.83 ns |        269.748 ns |      14.786 ns |  0.1717 |      - |   1,089 B |
|         DefaultPool |           8 |            16 |      1,211.42 ns |         64.968 ns |       3.561 ns |  0.0992 |      - |     632 B |
|          ScaledPool |           8 |            16 |        794.87 ns |        188.223 ns |      10.317 ns |  0.1001 |      - |     632 B |
| DefaultPoolThreaded |           8 |            16 |      4,380.27 ns |        616.885 ns |      33.814 ns |  0.2365 |      - |   1,494 B |
|  ScaledPoolThreaded |           8 |            16 |      4,191.86 ns |        807.913 ns |      44.284 ns |  0.2441 |      - |   1,561 B |
|         DefaultPool |          16 |             0 |        940.04 ns |         84.770 ns |       4.647 ns |  0.0286 |      - |     184 B |
|          ScaledPool |          16 |             0 |        491.76 ns |        104.593 ns |       5.733 ns |  0.0286 |      - |     184 B |
| DefaultPoolThreaded |          16 |             0 |      2,936.22 ns |        219.266 ns |      12.019 ns |  0.1259 |      - |     804 B |
|  ScaledPoolThreaded |          16 |             0 |      3,318.14 ns |        300.985 ns |      16.498 ns |  0.1411 |      - |     885 B |
|         DefaultPool |          16 |             8 |      1,797.72 ns |        105.442 ns |       5.780 ns |  0.0687 |      - |     440 B |
|          ScaledPool |          16 |             8 |        737.92 ns |         52.615 ns |       2.884 ns |  0.0696 |      - |     440 B |
| DefaultPoolThreaded |          16 |             8 |      5,991.02 ns |     10,513.578 ns |     576.285 ns |  0.2060 |      - |   1,282 B |
|  ScaledPoolThreaded |          16 |             8 |      4,697.17 ns |      1,173.322 ns |      64.314 ns |  0.2136 |      - |   1,355 B |
|         DefaultPool |          16 |            16 |      2,712.82 ns |         72.291 ns |       3.963 ns |  0.1106 |      - |     696 B |
|          ScaledPool |          16 |            16 |        967.34 ns |         33.708 ns |       1.848 ns |  0.1106 |      - |     696 B |
| DefaultPoolThreaded |          16 |            16 |      6,359.90 ns |      2,140.569 ns |     117.332 ns |  0.2441 |      - |   1,529 B |
|  ScaledPoolThreaded |          16 |            16 |      5,489.48 ns |        996.884 ns |      54.643 ns |  0.2975 |      - |   1,854 B |
|         DefaultPool |          32 |             0 |      3,688.01 ns |        340.371 ns |      18.657 ns |  0.0496 |      - |     312 B |
|          ScaledPool |          32 |             0 |        959.45 ns |         96.638 ns |       5.297 ns |  0.0496 |      - |     312 B |
| DefaultPoolThreaded |          32 |             0 |      7,078.16 ns |      1,508.652 ns |      82.694 ns |  0.1602 |      - |   1,042 B |
|  ScaledPoolThreaded |          32 |             0 |      6,605.99 ns |        736.397 ns |      40.364 ns |  0.2289 |      - |   1,457 B |
|         DefaultPool |          32 |             8 |      5,329.30 ns |        530.654 ns |      29.087 ns |  0.0839 |      - |     568 B |
|          ScaledPool |          32 |             8 |      1,219.61 ns |        343.039 ns |      18.803 ns |  0.0896 |      - |     568 B |
| DefaultPoolThreaded |          32 |             8 |     15,415.34 ns |      3,619.665 ns |     198.406 ns |  0.2136 |      - |   1,442 B |
|  ScaledPoolThreaded |          32 |             8 |      9,148.38 ns |      2,839.667 ns |     155.652 ns |  0.2899 |      - |   1,828 B |
|         DefaultPool |          32 |            16 |      7,009.57 ns |        730.620 ns |      40.048 ns |  0.1297 |      - |     824 B |
|          ScaledPool |          32 |            16 |      1,460.06 ns |        407.207 ns |      22.320 ns |  0.1297 |      - |     824 B |
| DefaultPoolThreaded |          32 |            16 |     20,256.24 ns |        235.299 ns |      12.898 ns |  0.3052 |      - |   1,934 B |
|  ScaledPoolThreaded |          32 |            16 |     11,299.54 ns |        132.306 ns |       7.252 ns |  0.3662 |      - |   2,289 B |
|         DefaultPool |         256 |             0 |    211,333.36 ns |      3,244.642 ns |     177.850 ns |  0.2441 |      - |   2,104 B |
|          ScaledPool |         256 |             0 |      7,690.17 ns |      1,184.234 ns |      64.912 ns |  0.3204 |      - |   2,104 B |
| DefaultPoolThreaded |         256 |             0 |    313,654.62 ns |      9,027.318 ns |     494.818 ns |  0.4883 |      - |   4,691 B |
|  ScaledPoolThreaded |         256 |             0 |     54,328.24 ns |      2,422.108 ns |     132.764 ns |  1.4038 | 0.0610 |   8,643 B |
|         DefaultPool |         256 |             8 |    228,629.05 ns |     23,122.629 ns |   1,267.429 ns |  0.2441 |      - |   2,360 B |
|          ScaledPool |         256 |             8 |      8,030.68 ns |      1,678.930 ns |      92.028 ns |  0.3662 |      - |   2,360 B |
| DefaultPoolThreaded |         256 |             8 |    328,962.43 ns |     54,308.771 ns |   2,976.847 ns |  0.4883 |      - |   5,000 B |
|  ScaledPoolThreaded |         256 |             8 |     56,798.18 ns |     10,027.466 ns |     549.639 ns |  1.4648 | 0.0610 |   9,089 B |
|         DefaultPool |         256 |            16 |    238,517.85 ns |     15,151.359 ns |     830.497 ns |  0.2441 |      - |   2,616 B |
|          ScaledPool |         256 |            16 |      8,239.36 ns |        592.850 ns |      32.496 ns |  0.4120 |      - |   2,616 B |
| DefaultPoolThreaded |         256 |            16 |    343,755.09 ns |     33,664.928 ns |   1,845.288 ns |  0.4883 |      - |   5,345 B |
|  ScaledPoolThreaded |         256 |            16 |     59,618.83 ns |      3,281.243 ns |     179.856 ns |  1.5869 | 0.0610 |   9,427 B |
|         DefaultPool |        1024 |             0 |  3,398,009.77 ns |    238,064.426 ns |  13,049.114 ns |       - |      - |   8,250 B |
|          ScaledPool |        1024 |             0 |     31,462.74 ns |      6,005.615 ns |     329.188 ns |  1.2817 |      - |   8,248 B |
| DefaultPoolThreaded |        1024 |             0 |  8,614,992.19 ns |  8,226,028.008 ns | 450,896.353 ns |       - |      - |  38,970 B |
|  ScaledPoolThreaded |        1024 |             0 |    199,449.64 ns |    162,597.330 ns |   8,912.508 ns |  6.5918 | 0.9766 |  35,416 B |
|         DefaultPool |        1024 |             8 |  3,382,532.81 ns |    274,655.386 ns |  15,054.788 ns |       - |      - |   8,507 B |
|          ScaledPool |        1024 |             8 |     31,712.71 ns |      6,495.328 ns |     356.031 ns |  1.3428 |      - |   8,504 B |
| DefaultPoolThreaded |        1024 |             8 |  9,150,567.71 ns |  2,476,182.558 ns | 135,727.922 ns |       - |      - |  39,117 B |
|  ScaledPoolThreaded |        1024 |             8 |    221,463.66 ns |     48,602.706 ns |   2,664.078 ns |  6.8359 | 0.9766 |  37,705 B |
|         DefaultPool |        1024 |            16 |  3,479,460.29 ns |    249,832.734 ns |  13,694.175 ns |       - |      - |   8,763 B |
|          ScaledPool |        1024 |            16 |     31,992.68 ns |     11,835.673 ns |     648.753 ns |  1.3428 |      - |   8,760 B |
| DefaultPoolThreaded |        1024 |            16 |  9,574,946.35 ns |  2,604,403.907 ns | 142,756.166 ns |       - |      - |  39,992 B |
|  ScaledPoolThreaded |        1024 |            16 |    206,739.87 ns |     38,262.191 ns |   2,097.280 ns |  6.8359 | 0.9766 |  38,010 B |
|         DefaultPool |        2048 |             0 | 13,320,174.48 ns |  1,762,278.720 ns |  96,596.443 ns |       - |      - |  16,450 B |
|          ScaledPool |        2048 |             0 |     62,331.40 ns |     12,716.250 ns |     697.021 ns |  2.5635 | 0.1221 |  16,440 B |
| DefaultPoolThreaded |        2048 |             0 | 30,269,091.67 ns |    693,908.844 ns |  38,035.485 ns |       - |      - |  80,080 B |
|  ScaledPoolThreaded |        2048 |             0 |    418,387.11 ns |    394,976.144 ns |  21,649.975 ns | 12.2070 | 2.9297 |  72,913 B |
|         DefaultPool |        2048 |             8 | 13,402,975.00 ns |  3,371,275.902 ns | 184,791.008 ns |       - |      - |  16,709 B |
|          ScaledPool |        2048 |             8 |     62,205.85 ns |      3,719.147 ns |     203.859 ns |  2.5635 | 0.1221 |  16,697 B |
| DefaultPoolThreaded |        2048 |             8 | 31,661,502.08 ns |  7,142,611.831 ns | 391,510.656 ns |       - |      - |  80,196 B |
|  ScaledPoolThreaded |        2048 |             8 |    398,977.90 ns |    253,608.316 ns |  13,901.128 ns | 10.2539 | 2.4414 |  60,371 B |
|         DefaultPool |        2048 |            16 | 13,505,204.17 ns |  1,375,577.311 ns |  75,400.034 ns |       - |      - |  16,965 B |
|          ScaledPool |        2048 |            16 |     62,527.04 ns |      4,092.726 ns |     224.336 ns |  2.6855 | 0.1221 |  16,953 B |
| DefaultPoolThreaded |        2048 |            16 | 30,531,625.00 ns | 12,412,926.068 ns | 680,394.363 ns |       - |      - |  80,959 B |
|  ScaledPoolThreaded |        2048 |            16 |    403,439.65 ns |    131,650.477 ns |   7,216.207 ns | 11.2305 | 2.4414 |  66,841 B |
[Pools.zip](https://github.com/dotnet/aspnetcore/files/10078581/Pools.zip)

geeknoid avatar Nov 23 '22 19:11 geeknoid

Thanks for your PR, @geeknoid. Someone from the team will get assigned to your PR shortly and we'll get it reviewed.

ghost avatar Nov 23 '22 19:11 ghost

/benchmark plaintext aspnet-citrine-win kestrel

BrennanConroy avatar Nov 23 '22 19:11 BrennanConroy

Benchmark started for plaintext on aspnet-citrine-win with kestrel. Logs: link

pr-benchmarks[bot] avatar Nov 23 '22 19:11 pr-benchmarks[bot]

Oh wait, this is ObjectPool, that wouldn't affect Kestrel specific benchmarks

BrennanConroy avatar Nov 23 '22 20:11 BrennanConroy

Benchmark code from the zip:

// © Microsoft Corporation. All rights reserved.

using System.Collections.Generic;
using System.Threading.Tasks;
using BenchmarkDotNet.Attributes;
using Microsoft.Extensions.ObjectPool;

namespace Microsoft.R9.Extensions.Pools.Bench;

#pragma warning disable R9A038 // Use 'Microsoft.R9.Extensions.Pools.ScaledObjectPool' instead of 'Microsoft.Extensions.ObjectPool.DefaultObjectPool' for improved performance
#pragma warning disable S109 // Magic numbers should not be used
#pragma warning disable R9A041 // Use concrete types when possible for improved performance

[MemoryDiagnoser]
public class Pools
{
    private ObjectPool<Foo> _defaultPool = null!;
    private ObjectPool<Foo> _scaledPool = null!;

    [Params(2, 8, 16, 32, 256, 1024, 2048)]
    public int MaxRetained { get; set; }

    [Params(0, 8, 16)]
    public int OverflowCount { get; set; }

    [GlobalSetup]
    public void GlobalSetup()
    {
        _defaultPool = new DefaultObjectPool<Foo>(new DefaultPooledObjectPolicy<Foo>(), MaxRetained);
        _scaledPool = PoolFactory.CreatePool<Foo>(new DefaultPooledObjectPolicy<Foo>(), MaxRetained);

        for (int i = 0; i < MaxRetained; i++)
        {
            _defaultPool.Return(new Foo());
            _scaledPool.Return(new Foo());
        }

    }

    [Benchmark]
    public void DefaultPool()
    {
        UsePool(_defaultPool);
    }

    [Benchmark]
    public void ScaledPool()
    {
        UsePool(_scaledPool);
    }

    [Benchmark]
    public void DefaultPoolThreaded()
    {
        var t0 = Task.Run(() => UsePool(_defaultPool));
        var t1 = Task.Run(() => UsePool(_defaultPool));

#pragma warning disable VSTHRD002 // Avoid problematic synchronous waits
        Task.WaitAll(t0, t1);
#pragma warning restore VSTHRD002 // Avoid problematic synchronous waits
    }

    [Benchmark]
    public void ScaledPoolThreaded()
    {
        var t0 = Task.Run(() => UsePool(_scaledPool));
        var t1 = Task.Run(() => UsePool(_scaledPool));

#pragma warning disable VSTHRD002 // Avoid problematic synchronous waits
        Task.WaitAll(t0, t1);
#pragma warning restore VSTHRD002 // Avoid problematic synchronous waits
    }

    private void UsePool(ObjectPool<Foo> pool)
    {
        var list = new List<Foo>(MaxRetained + OverflowCount);

        for (int i = 0; i < MaxRetained + OverflowCount; i++)
        {
            var foo = pool.Get();
            list.Add(foo);
        }

        foreach (var f in list)
        {
            pool.Return(f);
        }
    }

    private sealed class Foo
    {
    }
}

adityamandaleeka avatar Nov 29 '22 00:11 adityamandaleeka

@sebastienros PTAL

adityamandaleeka avatar Nov 29 '22 00:11 adityamandaleeka

It would be good to check the microbenchmark into a perf/Microbenchmarks directory like the middleware one here as part of this PR. That will make it easier to comment on too.

I think it would probably make sense to test with more threads concurrently getting and then immediately returning pooled objects. As it stands, I don't think this test is creating as much contention as a high-scale web application with objects being both added and removed simultaneously. I think in this more realistic scenario, using the most-recently returned item (i.e. _firstItem in the old code) is a lot more important for maintaining cache locality. I also wonder if the list allocations are having an undue impact on the result.

We should figure out what scale we want to focus on. DefaultObjectPoolProvider.MaximumRetained defaults to Environment.ProcessorCount * 2 currently. I think a lot of VMs are running on 8 or less vCPUs which would mean retaining 16 or less objects in the pool. Maybe @sebastienros has a better idea what VM sizes are most popular for ASP.NET Core apps on Azure though. Of course, even if we focus on less retained objects, I don't think that means we should have orders of magnitude worse performance with more retained objects. We might have to look at switching strategies based on the retained size.

We've discussed moving a similar type to the runtime (see https://github.com/dotnet/runtime/issues/49680), but I don't think that should stop us from improving the existing type.

halter73 avatar Nov 29 '22 00:11 halter73

@halter73 We run our pools usually with 1024 or 2048 entries so that we can recover from spiky traffic efficiently (if you imagine traffic spiking at 1K RPS and sometimes dropping to <100 RPS). Thus in our environment, it's not bound to the number of processors in the machine, it's bound to the number of concurrent operations that may need one of the pooled resources. That is, we use these pools not so much to avoid repeated allocations of the same object while processing a single request, but more to avoid allocations across multiple incoming concurrent requests.

Using a stack would likely be a net improvement over using a bag for this case, but alas ConcurrentStack allocates and is thus ineffective for this use. But the stack only has a significant effect for quick get/put cases in quick succession, and shouldn't matter nearly as much when dealing with the broader cross-request cases as described above.

Let me run the test with more contention, although I suspect the results will be very similar. The existing pool implementation is not designed for anything but very small retained sets.

geeknoid avatar Nov 29 '22 20:11 geeknoid

What changed between this run and the previous one? The original numbers showed worse perf at lower scale, e.g.

|         DefaultPool |           2 |             0 |         44.68 ns |
|          ScaledPool |           2 |             0 |         68.11 ns |

BrennanConroy avatar Nov 30 '22 01:11 BrennanConroy

@BrennanConroy Seems like my benchmark wasn't right. I'm updating it now and will report back once done.

geeknoid avatar Nov 30 '22 19:11 geeknoid

I've updated the implementation a bit and have a brand new more detailed benchmark. I'm attaching the benchmark as a zip file, and the README file in there shows the benchmark results.

A key observation is that there tends to be two modalities of use for pooled objects. Short-lived rentals vs. long-lived rentals. Short-lived is something like "I need a StringBuilder and will return it very quickly" as opposed to "I need a FooBar for the duration of this request's lifetime". To satisfy the first case, you only need a few items in the pool. To satisfy the second case, you need as many items as there is potential variation in the # of concurrent requests processed.

Anyway, the implementation update deals with this duality and delivers better perf all around than the existing implementation. Pools.Bench.zip

geeknoid avatar Dec 10 '22 18:12 geeknoid

Thanks for re-optimizing the return-it-very-quickly case. Given the microbenchmark results seem to show at most a small regression in some already-fast cases and much better performance in "scaled" cases, I'm personally leaning towards taking this.

Can you rename the Pools.Bench to Microsoft.Extensions.ObjectPoo.Microbenchmarks and put project in src/ObjectPool/perf/Microbenchmarks and define IS_BENCHMARKS in the csproj? That would make it easier to review. Feel free to keep the readme in there with your results. You can use one of the other "Microbenchmarks" projects like Kestrel's as an example

halter73 avatar Dec 12 '22 22:12 halter73

@halter73 I added the benchmark code as requested. I had to modify it to remove the comparison with the old version (since the code base only contains the new version at this point). But I left the results intact in the readme with an explanation.

geeknoid avatar Dec 13 '22 12:12 geeknoid

Merge away!

davidfowl avatar Dec 15 '22 02:12 davidfowl

@geeknoid, this change will be considered for inclusion in the blog post for the release it'll ship in. Nice work!

Please ensure that the original comment in this thread contains a clear explanation of what the change does, why it's important (what problem does it solve?), and, if relevant, include things like code samples and/or performance numbers.

This content may not be exactly what goes into the blog post, but it will help the team putting together the announcement.

Thanks!

ghost avatar Dec 15 '22 17:12 ghost

Do we / how much do we care about performance of the pool on .NET Framework? If we don't care, great. If we do care, I think there's a good chance this will have regressed performance there, since ConcurrentQueue<T> on .NET Framework likes to allocate.

stephentoub avatar Jan 04 '23 15:01 stephentoub

Hi @stephentoub. It looks like you just commented on a closed PR. The team will most probably miss it. If you'd like to bring something important up to their attention, consider filing a new issue and add enough details to build context.

ghost avatar Jan 04 '23 15:01 ghost

@stephentoub good point. I know we "care" but we don't have any benchmarks for .NET Framework. @geeknoid How much do you care about this regressing .NET Framework performance?

davidfowl avatar Jan 04 '23 15:01 davidfowl

I care a little, but I'm not overly concerned. In general, if teams care deeply about perf, they will not be using the legacy runtimes.

However, we could put back the old code as-is with a #if. It's not a big deal.

geeknoid avatar Jan 04 '23 19:01 geeknoid

Hi @geeknoid. It looks like you just commented on a closed PR. The team will most probably miss it. If you'd like to bring something important up to their attention, consider filing a new issue and add enough details to build context.

ghost avatar Jan 04 '23 19:01 ghost