aspnetcore
aspnetcore copied to clipboard
Use System.Threading.Lock in Kestrel
Use System.Threading.Lock in Kestrel
Replacing lock objects with System.Threading.Lock
Description
- Addressing Servers/Kestrel module
- HTTP3 has a few dictionaries/lists locked that is not addressed in this PR
- Question: should these objects have a corresponding Lock field to lock on instead of the actual object being locked?
- In my personal measurements (not attached to this PR) the
Locktype performed marginally better compared to the regular Monitor Lock when there is no contention. - Is there a way to run perf tests with the
/benchmark plaintext aspnet-citrine-lin kestrelor is there a particular benchmark that might be interesting to run? Not sure if there is anything designed for locking and contention. - The objects changed does not seem to be passed around otherwise (for other types also locking them), so I think these are safe. (passing the
Lockto a method acceptingobjectand locking on theobjectresults a problematic behavior as I see)
Linked #56794
Is there a way to run perf tests with the /benchmark plaintext aspnet-citrine-lin kestrel
The team members can run those for you - it needs write permission on the repo.
- Is there a way to run perf tests with the
/benchmark plaintext aspnet-citrine-lin kestrelor is there a particular benchmark that might be interesting to run? Not sure if there is anything designed for locking and contention.
In theory using Lock can be more performant than locking with object, idk how much the CLR has done to actually implement the theoretical improvements yet. But I would be very surprised if we saw any measurable differences in E2E benchmarks (considering normal benchmark noise). In the past we've seen Http2 affected by contention and did a big refactor to reduce that, so maybe it would have been a nice test, but less so now 😄
/benchmark plaintext aspnet-citrine-lin kestrel
Yes that is my impression as well.
Benchmark started for plaintext on aspnet-citrine-lin with kestrel. Logs: link
plaintext - aspnet-citrine-lin
| application | plaintext.base | plaintext.pr | |
|---|---|---|---|
| Max CPU Usage (%) | 100 | 99 | -1.00% |
| Max Cores usage (%) | 2,788 | 2,785 | -0.11% |
| Max Working Set (MB) | 91 | 92 | +1.10% |
| Max Private Memory (MB) | 592 | 592 | 0.00% |
| Build Time (ms) | 3,608 | 3,795 | +5.18% |
| Start Time (ms) | 228 | 225 | -1.32% |
| Published Size (KB) | 106,590 | 106,590 | 0.00% |
| Symbols Size (KB) | 53 | 53 | 0.00% |
| .NET Core SDK Version | 9.0.100-rc.1.24408.10 | 9.0.100-rc.1.24408.10 | |
| Max CPU Usage (%) | 99 | 100 | +0.44% |
| Max Working Set (MB) | 95 | 96 | +0.55% |
| Max GC Heap Size (MB) | 25 | 25 | -0.79% |
| Size of committed memory by the GC (MB) | 16 | 17 | +5.14% |
| Max Number of Gen 0 GCs / sec | 4.00 | 3.00 | -25.00% |
| Max Number of Gen 1 GCs / sec | 3.00 | 2.00 | -33.33% |
| Max Number of Gen 2 GCs / sec | 3.00 | 2.00 | -33.33% |
| Max Gen 0 GC Budget (MB) | 26 | 26 | 0.00% |
| Max Time in GC (%) | 8.00 | 6.00 | -25.00% |
| Max Gen 0 Size (B) | 1,569,560 | 1,731,808 | +10.34% |
| Max Gen 1 Size (B) | 1,954,968 | 2,312,136 | +18.27% |
| Max Gen 2 Size (B) | 2,479,352 | 2,397,416 | -3.30% |
| Max LOH Size (B) | 95,248 | 95,248 | 0.00% |
| Max POH Size (B) | 9,463,448 | 9,500,528 | +0.39% |
| Max Allocation Rate (B/sec) | 14,922,296 | 14,312,544 | -4.09% |
| Max GC Heap Fragmentation (%) | 586% | 694% | +18.45% |
| # of Assemblies Loaded | 62 | 62 | 0.00% |
| Max Exceptions (#/s) | 1,302 | 1,304 | +0.15% |
| Max Lock Contention (#/s) | 72 | 309 | +329.17% |
| Max ThreadPool Threads Count | 48 | 48 | 0.00% |
| Max ThreadPool Queue Length | 586 | 528 | -9.90% |
| Max ThreadPool Items (#/s) | 781,916 | 784,181 | +0.29% |
| Max Active Timers | 1 | 1 | 0.00% |
| IL Jitted (B) | 285,669 | 287,410 | +0.61% |
| Methods Jitted | 3,504 | 3,519 | +0.43% |
| load | plaintext.base | plaintext.pr | |
|---|---|---|---|
| Max CPU Usage (%) | 98 | 98 | 0.00% |
| Max Cores usage (%) | 2,754 | 2,749 | -0.18% |
| Max Working Set (MB) | 46 | 46 | 0.00% |
| Max Private Memory (MB) | 370 | 370 | 0.00% |
| Build Time (ms) | 3,369 | 3,513 | +4.27% |
| Start Time (ms) | 0 | 0 | |
| Published Size (KB) | 72,283 | 72,283 | 0.00% |
| Symbols Size (KB) | 0 | 0 | |
| .NET Core SDK Version | 8.0.303 | 8.0.303 | |
| First Request (ms) | 113 | 114 | +0.88% |
| Requests/sec | 11,815,358 | 11,814,488 | -0.01% |
| Requests | 178,307,264 | 178,285,096 | -0.01% |
| Mean latency (ms) | 1.31 | 1.35 | +3.05% |
| Max latency (ms) | 77.08 | 70.29 | -8.81% |
| Bad responses | 0 | 0 | |
| Socket errors | 0 | 0 | |
| Read throughput (MB/s) | 1,423.36 | 1,423.36 | 0.00% |
| Latency 50th (ms) | 0.67 | 0.68 | +0.74% |
| Latency 75th (ms) | 1.01 | 1.03 | +1.98% |
| Latency 90th (ms) | 2.02 | 2.18 | +7.92% |
| Latency 99th (ms) | 14.09 | 14.68 | +4.19% |
CI failure seems likely to be addressed by https://github.com/dotnet/aspnetcore/pull/57269.
My concern is that under lock contention the new lock does not seem to perform that well in my tests on x64. Looks nice on the first benchmark, then I repeat the tests and consistently get 30% slower results for all following benchmarks
I wouldn't expect these locks to be too contended under normal usage with the exception of Http2FrameWriter._windowUpdateLock, so maybe hold off on that one if we're worried about contended performance.
Has Lock seen much usage in runtime yet? I see that it now used for TimeQueue. If we can get good contended performance (I'm not sure why it would be worse), System.IO.Pipelines and System.Thread.Channels seem like good candidates. It would require an extra allocation for channels though.
@kouvel Do you have any idea why @ladeak might be seeing a slowdown with Lock when he repeats tests? And do we have a good rule of thumb for when to use Lock vs a plain object?
@ladeak, can you share your tests? Do the tests do some warmup to ensure that methods get tiered up?
There can be slight differences in performance due to different code gen and TLS access, but when using the lock keyword Lock typically performance equal or better than Monitor locks. Using Enter/Exit directly may be a bit slower in some cases when using Lock due to a slightly more expensive TLS lookup.
@kouvel let me attach the benchmarks and the results:
BenchmarkRunner.Run<Benchmarks>();
[SimpleJob]
public class Benchmarks
{
private int _value = 0;
private const int N = 10000;
private readonly object _regularLock = new();
private readonly Lock _newLock = new();
[Benchmark]
public async Task RegularLock()
{
var t1 = Task.Run(Update);
var t2 = Task.Run(Update);
var t3 = Task.Run(Update);
var t4 = Task.Run(Update);
await Task.WhenAll(t1, t2, t3, t4);
void Update()
{
for (int i = 0; i < N; i++)
{
lock (_regularLock)
_value++;
}
}
}
[Benchmark]
public async Task NewLock()
{
var t1 = Task.Run(Update);
var t2 = Task.Run(Update);
var t3 = Task.Run(Update);
var t4 = Task.Run(Update);
await Task.WhenAll(t1, t2, t3, t4);
void Update()
{
for (int i = 0; i < N; i++)
{
lock (_newLock)
_value++;
}
}
}
}
This is the result I get most of the times:
BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.3958/23H2/2023Update/SunValley3)
12th Gen Intel Core i7-1255U, 1 CPU, 12 logical and 10 physical cores
.NET SDK 9.0.100-preview.7.24371.4
[Host] : .NET 9.0.0 (9.0.24.36618), X64 RyuJIT AVX2
DefaultJob : .NET 9.0.0 (9.0.24.36618), X64 RyuJIT AVX2
| Method | Mean | Error | StdDev |
|------------ |---------:|----------:|----------:|
| RegularLock | 1.661 ms | 0.0331 ms | 0.0719 ms |
| NewLock | 3.109 ms | 0.0376 ms | 0.0351 ms |
The numbers slightly vary for RegularLock 1.6-2.9ms and NewLock 2.4-3.1ms, for example another run:
| Method | Mean | Error | StdDev |
|------------ |---------:|----------:|----------:|
| RegularLock | 2.091 ms | 0.0611 ms | 0.1802 ms |
| NewLock | 2.836 ms | 0.0712 ms | 0.2099 ms |
I expect larger variance due to the nature of the tests. I tried different N-s ie. 1000, but the results are similar (unless I go with a very small N).
Very rarely I get results as follows (or even the NewLock being faster slightly):
| Method | Mean | Error | StdDev |
|------------ |---------:|----------:|----------:|
| RegularLock | 2.869 ms | 0.0569 ms | 0.1519 ms |
| NewLock | 2.944 ms | 0.1004 ms | 0.2961 ms |
// * Warnings *
MultimodalDistribution
Benchmarks.NewLock: Default -> It seems that the distribution can have several modes (mValue = 3)
Single Task
When I remove the parallel tasks and have an await t1;, and the results are different, and the Lock type is consistently faster:
| Method | Mean | Error | StdDev |
|------------ |---------:|--------:|--------:|
| RegularLock | 184.3 us | 1.62 us | 1.51 us |
| NewLock | 170.1 us | 1.15 us | 1.02 us |
3 Tasks
With 3 parallel tasks (instead of 4), I get results that are pretty close, but still favor of the old lock consistently:
| Method | Mean | Error | StdDev |
|------------ |-----------:|---------:|---------:|
| RegularLock | 920.3 us | 11.25 us | 10.52 us |
| NewLock | 1,098.0 us | 21.86 us | 58.72 us |
2 Tasks
Looks similar to 3 tasks:
| Method | Mean | Error | StdDev |
|------------ |---------:|---------:|---------:|
| RegularLock | 549.1 us | 10.67 us | 17.22 us |
| NewLock | 581.5 us | 11.05 us | 11.34 us |
let me attach the benchmarks and the results:
I'm seeing opposite results on my machine, but in any case maybe an issue is that 10000 is small enough that one started task progresses significantly before the next one is started and since uncontended lock acquisitions are much faster, they could progress a lot, and that would I imagine introduce a lot of variability to the results. So it's hard to say how much contention is actually happening during each iteration of the test. I typically test it differently, having multiple threads continuously doing the contentious work, and warming up and measuring changes to some count over time in some other thread. Maybe with BDN a setup/cleanup could achieve something like that. With this kind of lock usage, I would not expect Lock to be slower, I'm seeing that it's noticeably faster on my machine.
let me attach the benchmarks and the results:
What do you see with very large N, like 1,000,000? Too large an N may cause the method to not be called enough to get tiered up, but anyway just curious.
@kouvel with N = 1_000_000;
BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3)
12th Gen Intel Core i7-1255U, 1 CPU, 12 logical and 10 physical cores
.NET SDK 9.0.100-preview.7.24371.4
[Host] : .NET 9.0.0 (9.0.24.36618), X64 RyuJIT AVX2
DefaultJob : .NET 9.0.0 (9.0.24.36618), X64 RyuJIT AVX2
| Method | Mean | Error | StdDev |
|------------ |---------:|--------:|--------:|
| RegularLock | 318.3 ms | 1.84 ms | 1.63 ms |
| NewLock | 360.3 ms | 6.96 ms | 8.80 ms |
and after a second run:
| Method | Mean | Error | StdDev |
|------------ |---------:|--------:|--------:|
| RegularLock | 263.7 ms | 3.40 ms | 3.01 ms |
| NewLock | 373.7 ms | 7.01 ms | 7.20 ms |
and a 3rd run:
| Method | Mean | Error | StdDev |
|------------ |---------:|--------:|--------:|
| RegularLock | 234.8 ms | 3.62 ms | 3.39 ms |
| NewLock | 339.3 ms | 6.43 ms | 6.31 ms |
with
N = 1_000_000;
Ok, maybe there's a processor difference here. Based on the processor spec it seems like a low-wattage mobile processor. Maybe there are differences there that I'm not aware of. I'll try to post another test case that's similar to yours but without BDN, curious what you would see there. May be a bit before I get back though.
Take your time, I am going to sleep in the next few hours.
@ladeak, can you try this on your machine:
using System.Diagnostics;
static class Program
{
private static void Main()
{
TestMonitor();
TestLock();
}
private static void TestMonitor()
{
object lockObj = new();
int value = 0;
Test(
"Monitor",
ref value,
n =>
{
for (int i = 0; i < n; i++)
{
lock (lockObj)
value++;
}
});
}
private static void TestLock()
{
Lock lockObj = new();
int value = 0;
Test(
"Lock",
ref value,
n =>
{
for (int i = 0; i < n; i++)
{
lock (lockObj)
value++;
}
});
}
private static void Test(string name, ref int value, Action<int> update)
{
bool stop = false;
ThreadStart threadMain = () =>
{
while (!stop)
update(100);
};
const int ThreadCount = 4;
var threads = new Thread[ThreadCount];
for (int i = 0; i < ThreadCount; i++)
{
var t = new Thread(threadMain);
t.IsBackground = true;
t.Start();
threads[i] = t;
}
var sw = new Stopwatch();
Thread.Sleep(500); // warmup
for (int i = 0; i < 8; i++)
{
sw.Restart();
int startValue = value;
Thread.Sleep(500);
sw.Stop();
int changes = value - startValue;
Console.WriteLine($"{name,10}: {sw.Elapsed.TotalNanoseconds / changes,10:0.000} ns/change");
}
stop = true;
foreach (var t in threads)
t.Join();
}
}
On my machine (AMD Ryzen 9 5950X), I'm seeing numbers similar to these:
Monitor: 50.974 ns/change
Monitor: 49.492 ns/change
Monitor: 49.922 ns/change
Monitor: 50.590 ns/change
Monitor: 49.990 ns/change
Monitor: 50.124 ns/change
Monitor: 49.841 ns/change
Monitor: 49.883 ns/change
Lock: 44.997 ns/change
Lock: 44.390 ns/change
Lock: 44.946 ns/change
Lock: 44.924 ns/change
Lock: 44.703 ns/change
Lock: 44.684 ns/change
Lock: 44.857 ns/change
Lock: 44.880 ns/change
@kouvel
Monitor: 86,511 ns/change
Monitor: 84,948 ns/change
Monitor: 86,056 ns/change
Monitor: 82,722 ns/change
Monitor: 85,529 ns/change
Monitor: 86,259 ns/change
Monitor: 85,450 ns/change
Monitor: 84,780 ns/change
Lock: 78,784 ns/change
Lock: 78,250 ns/change
Lock: 77,387 ns/change
Lock: 76,139 ns/change
Lock: 79,439 ns/change
Lock: 77,637 ns/change
Lock: 80,203 ns/change
Lock: 79,250 ns/change
and then I run it again:
Monitor: 73,323 ns/change
Monitor: 72,383 ns/change
Monitor: 71,390 ns/change
Monitor: 72,732 ns/change
Monitor: 73,548 ns/change
Monitor: 73,232 ns/change
Monitor: 73,744 ns/change
Monitor: 73,970 ns/change
Lock: 81,881 ns/change
Lock: 83,385 ns/change
Lock: 84,567 ns/change
Lock: 84,398 ns/change
Lock: 83,790 ns/change
Lock: 83,837 ns/change
Lock: 84,742 ns/change
Lock: 84,603 ns/change
and again:
Monitor: 86,886 ns/change
Monitor: 85,706 ns/change
Monitor: 85,577 ns/change
Monitor: 87,071 ns/change
Monitor: 88,068 ns/change
Monitor: 88,368 ns/change
Monitor: 87,037 ns/change
Monitor: 87,302 ns/change
Lock: 80,370 ns/change
Lock: 80,248 ns/change
Lock: 80,148 ns/change
Lock: 79,105 ns/change
Lock: 75,010 ns/change
Lock: 80,219 ns/change
Lock: 78,221 ns/change
Lock: 79,416 ns/change
and a 4th time:
Monitor: 74,073 ns/change
Monitor: 74,139 ns/change
Monitor: 74,123 ns/change
Monitor: 74,676 ns/change
Monitor: 73,848 ns/change
Monitor: 74,512 ns/change
Monitor: 74,441 ns/change
Monitor: 75,091 ns/change
Lock: 71,967 ns/change
Lock: 71,727 ns/change
Lock: 71,873 ns/change
Lock: 70,318 ns/change
Lock: 71,796 ns/change
Lock: 71,635 ns/change
Lock: 71,204 ns/change
Lock: 72,196 ns/change
I think the results on this look better in the sense as in BDN the new Lock had worse results in general. It looks much better in this test
Ok maybe just a processor difference there, my results are varying a bit too between runs, but mostly similar. There are more things that can be improved for Lock, such as https://github.com/dotnet/runtime/issues/79485. I'm not too worried about the difference there.
@kouvel in your suggested measurement code I updated the ThreadCount const from value 4 to 2, and I get the following results:
Monitor: 23,446 ns/change
Monitor: 22,299 ns/change
Monitor: 23,097 ns/change
Monitor: 22,636 ns/change
Monitor: 23,446 ns/change
Monitor: 22,707 ns/change
Monitor: 22,448 ns/change
Monitor: 22,749 ns/change
Lock: 34,962 ns/change
Lock: 33,850 ns/change
Lock: 35,289 ns/change
Lock: 35,144 ns/change
Lock: 34,545 ns/change
Lock: 34,966 ns/change
Lock: 33,423 ns/change
Lock: 32,656 ns/change
2nd run
Monitor: 23,121 ns/change
Monitor: 22,298 ns/change
Monitor: 22,972 ns/change
Monitor: 23,040 ns/change
Monitor: 22,283 ns/change
Monitor: 22,390 ns/change
Monitor: 22,106 ns/change
Monitor: 22,301 ns/change
Lock: 26,932 ns/change
Lock: 26,819 ns/change
Lock: 26,672 ns/change
Lock: 26,820 ns/change
Lock: 26,858 ns/change
Lock: 26,787 ns/change
Lock: 26,837 ns/change
Lock: 26,911 ns/change
3rd run:
Monitor: 26,833 ns/change
Monitor: 26,722 ns/change
Monitor: 25,793 ns/change
Monitor: 26,854 ns/change
Monitor: 26,866 ns/change
Monitor: 26,718 ns/change
Monitor: 26,735 ns/change
Monitor: 26,741 ns/change
Lock: 41,130 ns/change
Lock: 41,578 ns/change
Lock: 40,824 ns/change
Lock: 40,474 ns/change
Lock: 36,992 ns/change
Lock: 39,834 ns/change
Lock: 36,275 ns/change
Lock: 35,051 ns/change
I repeated the tests a few times with different thread counts, and it seems with ThreadCount=4 Lock is consistently better, with 3 or 2, Monitor is consistently better. With 1 thread Lock ~1ns better.
Yea it is possible actually. In the test, the more number of times one thread can reacquire the lock in sequence, the better the result would be. For instance, using a Sleep(1) upon contention would make the test show very good results, but lock acquisition latency would get very high. Slight differences in timing could affect the result, it's probably more nuanced to benchmark performance under contention. I typically randomize how long the lock is held to help with stabilizing the results. There are some small spin-waiting differences in Lock to yield the thread more quickly upon contention, which could also affect the timing, and helps to not waste time spinning when other threads can do more useful work.
@BrennanConroy and @halter73 I think all questions are clarified.
If we're worried about regressing contended performance, should we continue to use object for Http2FrameWriter._windowUpdateLock and FlowControl._flowLock? I imagine Http2FrameWriter.WriteToOutputPipe will usually acquire the lock more than Http2Connection.ProcessWindowUpdateFrameAsync when writing a bunch of response bodies, but both methods will acquire the lock a lot from different threads. It seems most similar to the ThreadCount=2 scenario.
Similar logic goes for FlowControl._flowLock. That only affects request body uploads which are rarer, but could be worse because each concurrent stream tries to independently acquire the connection-level _flowLock on their own threadpool thread. It doesn't have logic to consolidate the input reading for all streams into a single loop that's analogous to what WriteToOutputPipe does for output writing.
The other locks seem like a purer win, since they will almost always be accessed from the application thread, and the locks are mostly there to handle early aborts.
I have been trying a few other things out, ie. WSL (Ubuntu 22.04.03), same machine
4 threads (the variance is huge between executions):
Monitor: 150.381 ns/change
Monitor: 148.997 ns/change
Monitor: 152.141 ns/change
Monitor: 154.373 ns/change
Monitor: 148.490 ns/change
Monitor: 153.407 ns/change
Monitor: 147.473 ns/change
Monitor: 133.847 ns/change
Lock: 126.061 ns/change
Lock: 126.451 ns/change
Lock: 125.610 ns/change
Lock: 126.589 ns/change
Lock: 125.851 ns/change
Lock: 120.473 ns/change
Lock: 105.485 ns/change
Lock: 105.622 ns/change
Monitor: 70.997 ns/change
Monitor: 70.977 ns/change
Monitor: 71.349 ns/change
Monitor: 71.223 ns/change
Monitor: 71.148 ns/change
Monitor: 71.697 ns/change
Monitor: 70.972 ns/change
Monitor: 70.906 ns/change
Lock: 114.761 ns/change
Lock: 113.196 ns/change
Lock: 114.455 ns/change
Lock: 113.534 ns/change
Lock: 114.359 ns/change
Lock: 113.941 ns/change
Lock: 114.224 ns/change
Lock: 114.851 ns/change
With 2 threads the results are pretty similar as on Windows, but with somewhat more variance between runs:
Monitor: 25.994 ns/change
Monitor: 26.257 ns/change
Monitor: 26.150 ns/change
Monitor: 26.087 ns/change
Monitor: 25.803 ns/change
Monitor: 26.018 ns/change
Monitor: 25.840 ns/change
Monitor: 25.803 ns/change
Lock: 25.146 ns/change
Lock: 24.732 ns/change
Lock: 24.764 ns/change
Lock: 25.223 ns/change
Lock: 25.227 ns/change
Lock: 25.133 ns/change
Lock: 25.582 ns/change
Lock: 25.067 ns/change
Some updates: with the latest preview the Lock type seems to be faster in all above cases (1-4 threads) compared to locking a non Lock instance.
@halter73 would you have further suggestions?
Thanks for all the work you've put into this! I think it would be better to first merge the System.Threading.Lock changes and separate the lock-free changes into another PR. And I'd keep the _flowLock as a plain object at least for now.
The lock-free changes look good to me too, but I would appreciate like @BrennanConroy @amcasey or maybe even @benaadams looking at it. Putting it in a separate PR to make will make it easier for others to review it on its own and make it easier to revert if need be.
Reverted this PR to use the scope of using Lock-s everywhere. Incorporating the latest measurements (which seemed to be faster in all cases), so InputFlowControl and Http2FrameWriter._windowUpdateLock uses it. Soon opening a new PR with the interlocked changes.
@halter73 I have opened a separate PR: https://github.com/dotnet/aspnetcore/pull/57968 for the interlocked changes in InputFlowControl
Thanks @ladeak. I'll try to find time to review #57968 soon.
@sebastienros Heads up in case this has any measurable perf impact.