parking_lot
parking_lot copied to clipboard
Heavily degraded performance while in extreme contention on some processors
This blog post Mutexes Are Faster Than Spinlocks generated a reddit discussion about lock implementatons. This made some users benchmark parking_lot
on windows and some results shown are very problematic, others could just be better (since parking_lot is always faster than spin on linux and mac, but not on windows for some reason).
I did not run this benchmark since I don't currently have windows installed, but since the user hadn't filed an issue I decided to post it here, their comment was: https://www.reddit.com/r/rust/comments/ejx7y8/blog_post_mutexes_are_faster_than_spinlocks/fd31um3/
The results:
Benchmark code: https://github.com/matklad/lock-bench Windows 10 Pro Intel Core i7-5930k @ 3.5 GHz stable-x86_64-pc-windows-msvc (default) rustc 1.40.0 (73528e339 2019-12-16)
extreme contention
cargo run --release 32 2 10000 100
Finished release [optimized] target(s) in 0.03s
Running `target\release\lock-bench.exe 32 2 10000 100`
Options {
n_threads: 32,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 32.452982ms min 20.4146ms max 45.2767ms
parking_lot::Mutex avg 154.509064ms min 111.2522ms max 180.4367ms
spin::Mutex avg 46.3496ms min 33.5478ms max 56.1689ms
AmdSpinlock avg 45.725299ms min 32.1936ms max 54.4236ms
std::sync::Mutex avg 33.383154ms min 18.2827ms max 46.0634ms
parking_lot::Mutex avg 134.983307ms min 95.5948ms max 176.1896ms
spin::Mutex avg 43.402769ms min 31.9209ms max 55.0075ms
AmdSpinlock avg 39.572361ms min 28.1705ms max 50.2935ms
heavy contention
cargo run --release 32 64 10000 100
Finished release [optimized] target(s) in 0.03s
Running `target\release\lock-bench.exe 32 64 10000 100`
Options {
n_threads: 32,
n_locks: 64,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 12.8268ms min 6.4807ms max 14.174ms
parking_lot::Mutex avg 8.470518ms min 3.6558ms max 10.0896ms
spin::Mutex avg 6.356252ms min 4.6299ms max 8.1838ms
AmdSpinlock avg 7.147972ms min 5.7731ms max 9.2027ms
std::sync::Mutex avg 12.790879ms min 3.7349ms max 14.4933ms
parking_lot::Mutex avg 8.526535ms min 6.7143ms max 10.0845ms
spin::Mutex avg 5.730139ms min 2.8063ms max 7.6221ms
AmdSpinlock avg 7.082415ms min 5.2678ms max 8.2064ms
light contention
cargo run --release 32 1000 10000 100
Finished release [optimized] target(s) in 0.05s
Running `target\release\lock-bench.exe 32 1000 10000 100`
Options {
n_threads: 32,
n_locks: 1000,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 7.736325ms min 4.3287ms max 9.194ms
parking_lot::Mutex avg 4.912407ms min 4.1386ms max 5.9617ms
spin::Mutex avg 3.787679ms min 3.2468ms max 4.8136ms
AmdSpinlock avg 4.229783ms min 1.0404ms max 5.2414ms
std::sync::Mutex avg 7.791248ms min 6.2809ms max 8.9858ms
parking_lot::Mutex avg 4.933393ms min 4.3319ms max 6.1515ms
spin::Mutex avg 3.782046ms min 3.3339ms max 5.4954ms
AmdSpinlock avg 4.22442ms min 3.1285ms max 5.3338ms
no contention
cargo run --release 32 1000000 10000 100
Finished release [optimized] target(s) in 0.03s
Running `target\release\lock-bench.exe 32 1000000 10000 100`
Options {
n_threads: 32,
n_locks: 1000000,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 12.465917ms min 8.8088ms max 13.6216ms
parking_lot::Mutex avg 5.164135ms min 4.2478ms max 6.1451ms
spin::Mutex avg 4.112927ms min 3.1624ms max 5.599ms
AmdSpinlock avg 4.302528ms min 4.0533ms max 5.4168ms
std::sync::Mutex avg 11.765036ms min 3.3567ms max 13.5108ms
parking_lot::Mutex avg 3.992219ms min 2.4974ms max 5.5604ms
spin::Mutex avg 3.425334ms min 2.0133ms max 4.7788ms
AmdSpinlock avg 3.813034ms min 2.2009ms max 5.0947ms
I wonder if the result would be different if we used SwitchToThread
instead of Sleep
when yielding. Can you test this? I don't have a Windows system.
I don't have a windows system either, the benchmark wasn't done by me, but another user got similar results in linux, maybe there is a common denominator.
https://www.reddit.com/r/rust/comments/ejx7y8/blog_post_mutexes_are_faster_than_spinlocks/fd3taac/
Linux system (rustc 1.41.0-nightly 2019-12-05, AMD 3900x 12 cores with SMT).
extreme contention
❯ cargo run --release 32 2 10000 100
Finished release [optimized] target(s) in 0.00s
Running `target/release/lock-bench 32 2 10000 100`
Options {
n_threads: 32,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 39.63915ms min 34.618755ms max 51.911789ms
parking_lot::Mutex avg 222.896391ms min 214.575148ms max 226.433204ms
spin::Mutex avg 20.253741ms min 12.694752ms max 38.933376ms
AmdSpinlock avg 17.53803ms min 11.353536ms max 51.322618ms
std::sync::Mutex avg 39.423473ms min 33.850454ms max 47.47324ms
parking_lot::Mutex avg 222.267268ms min 217.754466ms max 226.037187ms
spin::Mutex avg 20.186599ms min 12.566426ms max 62.728842ms
AmdSpinlock avg 17.215404ms min 11.445496ms max 46.907045ms
heavy contention
❯ cargo run --release 32 64 10000 100
Finished release [optimized] target(s) in 0.00s
Running `target/release/lock-bench 32 64 10000 100`
Options {
n_threads: 32,
n_locks: 64,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 8.144328ms min 7.676202ms max 8.855408ms
parking_lot::Mutex avg 6.590482ms min 1.666855ms max 8.721845ms
spin::Mutex avg 15.085528ms min 1.510395ms max 42.460191ms
AmdSpinlock avg 9.331913ms min 1.681545ms max 24.24093ms
std::sync::Mutex avg 8.117876ms min 7.600261ms max 8.398674ms
parking_lot::Mutex avg 5.605486ms min 1.647048ms max 8.671342ms
spin::Mutex avg 12.872803ms min 1.517989ms max 39.331793ms
AmdSpinlock avg 8.278936ms min 1.779218ms max 34.416964ms
light contention
❯ cargo run --release 32 1000 10000 100
Finished release [optimized] target(s) in 0.00s
Running `target/release/lock-bench 32 1000 10000 100`
Options {
n_threads: 32,
n_locks: 1000,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 4.673801ms min 4.271466ms max 5.416596ms
parking_lot::Mutex avg 1.379981ms min 1.12888ms max 1.714049ms
spin::Mutex avg 14.5374ms min 1.050929ms max 46.961405ms
AmdSpinlock avg 8.405825ms min 1.172899ms max 31.04467ms
std::sync::Mutex avg 4.660858ms min 4.333317ms max 5.126614ms
parking_lot::Mutex avg 1.379758ms min 1.176389ms max 1.749378ms
spin::Mutex avg 14.796396ms min 1.039289ms max 38.121532ms
AmdSpinlock avg 7.045806ms min 1.189589ms max 32.977048ms
no contention
❯ cargo run --release 32 1000000 10000 100
Finished release [optimized] target(s) in 0.00s
Running `target/release/lock-bench 32 1000000 10000 100`
Options {
n_threads: 32,
n_locks: 1000000,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 5.488052ms min 4.789075ms max 5.913014ms
parking_lot::Mutex avg 1.570826ms min 1.294428ms max 1.826788ms
spin::Mutex avg 1.383231ms min 1.162079ms max 1.678709ms
AmdSpinlock avg 1.363113ms min 1.120449ms max 1.582918ms
std::sync::Mutex avg 5.525267ms min 4.877406ms max 5.907605ms
parking_lot::Mutex avg 1.586628ms min 1.317512ms max 2.012493ms
spin::Mutex avg 1.388559ms min 1.231672ms max 1.603962ms
AmdSpinlock avg 1.38805ms min 1.145911ms max 1.590503ms
I tested on my machine on linux and cannot reproduce. Maybe it's a question about the entire system more than the operating system.
I'm using Linux, rustc 1.40.0 (73528e339 2019-12-16) and Intel i5-8250U 8 cores
I realize it's not ideal not having anyone able to reproduce, but the results were so unexpected that they deserved a bigger investigation. Specially when parking_lot
will be integrated with std.
I ran the same tests on windows instead of Linux. The results are here.
Also a run of the "extreme contention" case with SwitchToThread instead of Sleep:
Running `target\release\lock-bench.exe 32 2 10000 100`
Options {
n_threads: 32,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 48.186484ms min 46.201ms max 68.707ms
parking_lot::Mutex avg 190.429647ms min 185.9003ms max 195.4049ms
spin::Mutex avg 8.325163ms min 7.919ms max 9.3455ms
AmdSpinlock avg 7.891016ms min 7.5464ms max 9.2582ms
std::sync::Mutex avg 49.480777ms min 45.7975ms max 68.8675ms
parking_lot::Mutex avg 192.0885ms min 188.0831ms max 195.9926ms
spin::Mutex avg 8.310387ms min 8.0344ms max 10.1583ms
AmdSpinlock avg 7.894169ms min 7.5514ms max 8.7815ms
I've run the benchmark on windows (the same machine from blog post, just different OS), and can't reproduce perf degradation of parking lot.
@cynecx yes that's correct.
I just tried it on some other hardware: i7-4770T (Haswell, Desktop)
lock-bench> cargo run --release 32 2 10000 100
Finished release [optimized] target(s) in 0.00s
Running `target/release/lock-bench 32 2 10000 100`
Options {
n_threads: 32,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 29.457212ms min 25.718361ms max 31.862572ms
parking_lot::Mutex avg 14.884388ms min 11.719785ms max 17.62964ms
spin::Mutex avg 41.307676ms min 20.261094ms max 94.301218ms
AmdSpinlock avg 45.520435ms min 21.818796ms max 135.193445ms
std::sync::Mutex avg 29.411401ms min 24.678649ms max 32.887241ms
parking_lot::Mutex avg 15.304354ms min 11.821305ms max 17.14152ms
spin::Mutex avg 39.935237ms min 18.339029ms max 103.907344ms
AmdSpinlock avg 45.176324ms min 18.215179ms max 99.126875ms
i5-6200U (Skylake, Notebook)
Running `target/release/lock-bench 32 2 10000 100`
Options {
n_threads: 32,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 111.925873ms min 31.216135ms max 128.665372ms
parking_lot::Mutex avg 37.560599ms min 33.474334ms max 49.060267ms
spin::Mutex avg 121.052961ms min 24.271047ms max 187.949351ms
AmdSpinlock avg 122.133179ms min 16.167021ms max 178.323671ms
std::sync::Mutex avg 113.942109ms min 104.396629ms max 123.955679ms
parking_lot::Mutex avg 37.321911ms min 33.046695ms max 43.749566ms
spin::Mutex avg 125.359875ms min 24.666258ms max 182.409344ms
AmdSpinlock avg 117.566367ms min 16.328205ms max 176.297166ms
The Intel Core i7-5930k of the initial windows post seems to be an Haswell-E.
@matklad It seems to be related to the used microprocessor architecture. @theunknownxy's results show the same degradation on both oses (Windows and Linux).
EDIT: I've accidentally removed my previous comment.
I am not entirely sure but, perhaps this has to do with the SpinWait
? Since its spin-wait is using spin_loop_hint
which basically lowers to a pause
instruction. And it's quite known that the pause
instruction's latency varies on different microarchitectures.
Perhaps someone could experiment with lowering the condition here: https://github.com/Amanieu/parking_lot/blob/8f69413ae8271a2a2e9067e61392938f6dadd59c/core/src/spinwait.rs#L53
@cynecx Lowering the value in the condition does not seem to have an impact on the results.
self.counter <= 1
:
Running `target\release\lock-bench.exe 32 2 10000 100`
Options {
n_threads: 32,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 49.474712ms min 46.0883ms max 69.2954ms
parking_lot::Mutex avg 191.411675ms min 187.1015ms max 194.4445ms
spin::Mutex avg 8.319929ms min 8.0446ms max 9.2257ms
AmdSpinlock avg 7.870013ms min 7.4975ms max 8.8647ms
std::sync::Mutex avg 48.940684ms min 46.2259ms max 68.8908ms
parking_lot::Mutex avg 191.403276ms min 187.6034ms max 194.5025ms
spin::Mutex avg 8.540062ms min 7.9655ms max 33.9991ms
AmdSpinlock avg 7.869302ms min 7.6215ms max 8.4093ms
always using thread_yield (ignoring the counter):
Running `target\release\lock-bench.exe 32 2 10000 100`
Options {
n_threads: 32,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 48.54949ms min 46.0273ms max 68.669ms
parking_lot::Mutex avg 189.396441ms min 184.3732ms max 193.476ms
spin::Mutex avg 8.366757ms min 7.4263ms max 10.4656ms
AmdSpinlock avg 7.893677ms min 7.4782ms max 9.6092ms
std::sync::Mutex avg 48.805392ms min 46.4548ms max 68.7367ms
parking_lot::Mutex avg 189.496029ms min 184.8406ms max 194.1087ms
spin::Mutex avg 8.632158ms min 7.6492ms max 40.5276ms
AmdSpinlock avg 7.874248ms min 7.5315ms max 8.7431ms
Linux 4.15.0-72, AMD Ryzen 7 2700x (8 physical cores, enabled SMT), "extreme contention":
std::sync::Mutex avg 32.08904ms min 28.112784ms max 33.733244ms
parking_lot::Mutex avg 148.294697ms min 141.888311ms max 150.374082ms
spin::Mutex avg 87.941173ms min 46.837399ms max 158.922402ms
AmdSpinlock avg 78.83599ms min 47.9706ms max 133.216168ms
I'd recommend looking closer at AMD, especially ZEN based machines, as they seem to be the common denominator so far.
I'd recommend looking closer at AMD, especially ZEN based machines, as they seem to be the common denominator so far.
@Voultapher
The first windows benchmark is Intel, that specific model may be doing something similar to most AMD tho. Possibly related to Hyperthreading/SMT. Maybe if SMT is disabled and the bench is re-run the performance degradation for parking_lot
will disappear?
Disabling SMT improves situation, but the problem is still present:
std::sync::Mutex avg 30.765218ms min 27.16788ms max 34.434105ms
parking_lot::Mutex avg 80.858621ms min 25.013774ms max 112.516787ms
spin::Mutex avg 73.487341ms min 24.467014ms max 178.773357ms
AmdSpinlock avg 72.41482ms min 28.667988ms max 173.171411ms
Well, I don’t have access to an AMD Ryzen (and I can’t find a cloud-service which specifically offers them) but for those who have, perhaps some profiling (eg. perf) can shed some light?
Well, it seems that EC2 does have AMD instances (AMD EPYC 7571), so I got this:
System:
- EC2 t3a.2xlarge
- 8 vCores (AMD EPYC 7571)
- 32GB RAM
- Linux: 4.15.0-1051-aws
Options {
n_threads: 32,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 39.300861ms min 35.752879ms max 41.433415ms
parking_lot::Mutex avg 36.729054ms min 35.275409ms max 38.59546ms
spin::Mutex avg 92.0057ms min 40.678487ms max 206.923339ms
AmdSpinlock avg 70.080664ms min 34.398632ms max 193.121841ms
std::sync::Mutex avg 39.521573ms min 36.72074ms max 41.15228ms
parking_lot::Mutex avg 36.669689ms min 34.570288ms max 38.946108ms
spin::Mutex avg 96.520724ms min 40.866666ms max 207.652942ms
AmdSpinlock avg 72.529016ms min 33.687999ms max 196.055792ms
Options {
n_threads: 8,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 9.708629ms min 8.964155ms max 10.842344ms
parking_lot::Mutex avg 8.112247ms min 7.52242ms max 8.667363ms
spin::Mutex avg 10.472547ms min 7.978397ms max 11.968051ms
AmdSpinlock avg 8.328134ms min 7.276999ms max 23.551344ms
std::sync::Mutex avg 9.768446ms min 8.710725ms max 11.195713ms
parking_lot::Mutex avg 8.066572ms min 6.970456ms max 8.696175ms
spin::Mutex avg 10.505385ms min 7.947568ms max 11.267302ms
AmdSpinlock avg 8.161658ms min 5.269307ms max 8.675885ms
Doesn't seem to be affected :/
EDIT:
Also according to AMD's SKU naming scheme, this cpu is basically based on first generation Zen, however most of the problematic bechmarks above were run on second generation Zen.
Here are some results from a 3900x on windows:
$ cargo run --release 32 2 10000 100
std::sync::Mutex avg 46.573633ms min 44.3294ms max 65.4726ms
parking_lot::Mutex avg 181.645635ms min 106.3233ms max 185.5278ms
spin::Mutex avg 8.439861ms min 7.9094ms max 10.1592ms
AmdSpinlock avg 7.834648ms min 7.4119ms max 8.2538ms
It's obvious that the spinlock solution would do fairly well on that kind of machine. There are plenty of cores around, so the spinning threads don't need to get preempted that often. However the parking_lot vs std::mutex difference is really big.
I thin the conclusion that it's AMD specific didn't hold true so far, since the initial post used a Intel CPU. And it also doesn't seem to be constrainted to Windows, since someone on a 3900x on Linux also observed a degradation.
Interesting thing. I will try play around with some settings
SwitchToThread
vs Sleep
makes no difference for me.
I got a large improvement by increasing the amount of spins before a sleep. But it took really a very big increase, so my concern is I thereby just degraded it to the Spinlock:
pub fn spin(&mut self) -> bool {
if self.counter >= 20 {
return false;
}
self.counter += 1;
if self.counter <= 15 {
cpu_relax(1 << self.counter);
} else {
thread_parker::thread_yield();
}
true
}
Result:
parking_lot::Mutex avg 12.178126ms min 11.6343ms max 14.1958ms
So this goes from 2^3 spins to 2^15. Even with 2^9 there was no noticable improvement.
Increasing the number of thread_parker::thread_yield()
calls always decreases the performance. Therefore it seems like that kind of yielding to the OS might be inefficient.
E.g. going to
if self.counter >= 16 {
return false;
}
which avoids all thread yield calls just changes the result to
parking_lot::Mutex avg 12.370461ms min 11.8578ms max 14.3432ms
With disabled Hyperthreading I most times get
parking_lot::Mutex avg 80.075459ms min 47.6624ms max 98.0569ms
but at some times also half of it
parking_lot::Mutex avg 36.267395ms min 12.441ms max 53.9117ms
But it pretty much consistently changes between 40 and 80ms, with nothing in between 😵
Without Hyperthreading I also get better results if I do not increase the overall spins, but insert busy-spins before the OS yield (in a similiar fashion as crossbeam Backoff does):
pub fn spin(&mut self) -> bool {
if self.counter >= 10 {
return false;
}
self.counter += 1;
if self.counter <= 3 {
cpu_relax(1 << self.counter);
} else {
cpu_relax(1 << self.counter);
thread_parker::thread_yield();
}
true
}
parking_lot::Mutex avg 15.98723ms min 14.0822ms max 18.4982ms
With activated Hyperthreading this actually did not make that much of a difference
You can read linus thoughts about thread_yield in this post: https://www.realworldtech.com/forum/?threadid=189711&curpostid=189752.
I am also seeing the issue on my desktop:
-
Intel Xeon E5-2670 x 2 (2 processors, 8 cores per processor, 2 threads per core)
-
64 GB RAM
-
Linux 5.3.0-24-generic
It should be noted that this is a NUMA system as mentioned in Linus' reply to the article.
Here are the results for extreme contention (Heavy, light, and none look okay):
$ cargo run --release 32 2 10000 100
Options {
n_threads: 32,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 55.576253ms min 37.536375ms max 63.957731ms
parking_lot::Mutex avg 228.590416ms min 188.487466ms max 239.946827ms
spin::Mutex avg 44.905108ms min 31.901131ms max 67.000027ms
AmdSpinlock avg 46.383674ms min 32.800095ms max 93.837092ms
I also tried some of the modifications from @Matthias247
2^15 spins
std::sync::Mutex avg 56.492339ms min 37.714216ms max 84.766978ms
parking_lot::Mutex avg 84.547093ms min 14.349955ms max 132.215521ms
spin::Mutex avg 47.759024ms min 33.021815ms max 77.174472ms
AmdSpinlock avg 46.540309ms min 32.153268ms max 68.264111ms
2^15 spins, never yield
std::sync::Mutex avg 55.81399ms min 37.438165ms max 78.098643ms
parking_lot::Mutex avg 113.635406ms min 31.291235ms max 167.27489ms
spin::Mutex avg 49.064326ms min 34.045076ms max 74.999092ms
AmdSpinlock avg 47.007803ms min 31.443733ms max 81.399413ms
2^3 spins immediately before yield
std::sync::Mutex avg 56.807845ms min 37.624857ms max 80.015346ms
parking_lot::Mutex avg 207.616094ms min 119.964334ms max 220.407735ms
spin::Mutex avg 45.066819ms min 31.970404ms max 70.265231ms
AmdSpinlock avg 45.483723ms min 33.518564ms max 65.345176ms
I can reproduce on ThreadRipper 3970X.
Options {
n_threads: 32,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 43.973765ms min 36.940827ms max 48.005445ms
parking_lot::Mutex avg 287.913565ms min 258.751948ms max 304.911923ms
spin::Mutex avg 25.721195ms min 19.751268ms max 48.85489ms
AmdSpinlock avg 24.204215ms min 17.663769ms max 49.178313ms
std::sync::Mutex avg 44.113866ms min 39.096522ms max 48.283289ms
parking_lot::Mutex avg 289.014664ms min 267.429467ms max 305.30055ms
spin::Mutex avg 26.11682ms min 19.584014ms max 60.147676ms
AmdSpinlock avg 24.284636ms min 17.043627ms max 67.09198ms
Ran a modified version running only one std::sync::Mutex
loop under perf stat
:
std::sync::Mutex avg 43.501474ms min 38.28496ms max 47.539912ms
Performance counter stats for 'cargo run --release 32 2 10000 100':
119,327.91 msec task-clock # 8.225 CPUs utilized
2,388,322 context-switches # 0.020 M/sec
25,909 cpu-migrations # 0.217 K/sec
4,142 page-faults # 0.035 K/sec
438,819,580,111 cycles # 3.677 GHz (84.48%)
332,632,165,301 stalled-cycles-frontend # 75.80% frontend cycles idle (82.91%)
10,515,010,024 stalled-cycles-backend # 2.40% backend cycles idle (82.22%)
50,102,474,576 instructions # 0.11 insn per cycle
# 6.64 stalled cycles per insn (82.12%)
11,335,523,432 branches # 94.995 M/sec (83.25%)
157,822,216 branch-misses # 1.39% of all branches (85.02%)
14.508207241 seconds time elapsed
9.607140000 seconds user
109.520591000 seconds sys
And with parking_log::Mutex
only:
parking_lot::Mutex avg 273.396674ms min 254.347881ms max 289.119739ms
Performance counter stats for 'cargo run --release 32 2 10000 100':
222,159.62 msec task-clock # 5.920 CPUs utilized
21,304,902 context-switches # 0.096 M/sec
378,781 cpu-migrations # 0.002 M/sec
4,141 page-faults # 0.019 K/sec
444,159,684,024 cycles # 1.999 GHz (83.12%)
85,009,073,534 stalled-cycles-frontend # 19.14% frontend cycles idle (83.38%)
135,860,444,447 stalled-cycles-backend # 30.59% backend cycles idle (83.66%)
184,364,949,242 instructions # 0.42 insn per cycle
# 0.74 stalled cycles per insn (83.62%)
41,792,087,226 branches # 188.117 M/sec (83.24%)
615,614,410 branch-misses # 1.47% of all branches (82.98%)
37.525030261 seconds time elapsed
49.254354000 seconds user
174.546567000 seconds sys
The GHz numbers caught my attention, on top of everything else, and empirically the CPU usage levels during a run in htop seems to indicate less CPU used overall while the parking_lot::Mutex
test runs, compared to std::sync::Mutex
. I haven't looked to get more accurate data, though. Anyways, that got me wondering what the result would look like if the cpu frequency governor was changed from ondemand to performance, and this results in:
Options {
n_threads: 32,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 41.136917ms min 33.999931ms max 49.160068ms
parking_lot::Mutex avg 208.116968ms min 196.38674ms max 216.458502ms
spin::Mutex avg 18.496924ms min 13.221013ms max 43.77934ms
AmdSpinlock avg 15.954826ms min 11.889577ms max 39.562233ms
std::sync::Mutex avg 40.888678ms min 36.336717ms max 44.511702ms
parking_lot::Mutex avg 206.536087ms min 189.072666ms max 215.691531ms
spin::Mutex avg 18.482265ms min 14.164115ms max 63.214092ms
AmdSpinlock avg 15.733487ms min 12.418109ms max 38.475479ms
FWIW.
I'm not going to dive deeper into this right now, but I hope the data above can be useful somehow.
Are you sure you're measuring anything valid? Per Linus' response the code doesn't really measure what the blog poster thinks it's measuring.
This is mostly unrelated to the issue at hand, but, just to clear some confusion, the benchmark in this issue is from this blog post, while Linus is responding to another post. The benchmarks are different: the first tries to measure throughput, the second tries to measure latency.
This obviously doesn't mean that my benchmark isn't flawed in some terrible way as well :-)
Basically what this test is measuring is how effective the spin back-off is. parking_lot uses an exponential back-off strategy where it doubles the about of time it pauses between spins every time it misses the lock. This effectively allows the thread which currently holds the lock to rapidly lock/unlock without any cacheline interference from other threads.
The reason why you are seeing different results on different microarchitectures basically boils down to a change Intel made in the Skylake and later microarchitectures which changed the duration of the PAUSE x86 instruction from 10 cycles to 140 cycles (relevant issue in .NET). Since I tuned this delay loop on my personal computer, which has a Skylake CPU, the delay loop is much shorter (by about 10x) on other microarchitectures.
So from reading Linus' posts and other info the heavy degraded performance could be related to NUMA bouncing. The original WTF::Lock
https://webkit.org/blog/6161/locking-in-webkit/ uses sched_yield()
, so it seems does parking_lot
which uses std::thread::yield_now()
which in turn uses libc::sched_yield()
.
Linus talks a lot about informing the kernel about which threads are waiting on each other, are there APIs for that?
Hmm, the man page for sched_yield
says:
sched_yield() is intended for use with real-time scheduling policies
(i.e., SCHED_FIFO or SCHED_RR). Use of sched_yield() with
nondeterministic scheduling policies such as SCHED_OTHER is
unspecified and very likely means your application design is broken.
The rust documentation for std::thread::yield_now()
reads very different https://doc.rust-lang.org/std/thread/fn.yield_now.html. Clearly there seems to be a disconnect between what kernel developers want this to be used for, and what many people are really using it for. I'm not aware of a more appropriate API for those use cases, short of sticking to kernel supported synchronization. Is this an API hole, or a more fundamental design issue?
My current thoughts about this:
- Even if thread_yield might output the best latency/throughput in microbenchmarks, these benchmarks are flawed as they only consider themselves. So any other workload on the machine can impact these benchmarks by a lot. And I don't think I saw a benchmark that tries to see what impacts these lock mechanisms have on other running processes or on total CPU power consumption. Doing so might make benchmarks a little less broken. I might be wrong tho.
-
thread_yield
(at least on linux) is just a hint for the scheduler, as Linus stated in another post (read the thread it's very interesting), the scheduler might not schedule the thread you want even If you just "liberated" a slot for it to run as this other thread affinity might be to another CPU core or worse to another NUMA node.
Edit some kernel related links about this:
- http://www.linux-magazine.com/Issues/2016/193/Kernel-News
- https://www.linuxplumbersconf.org/event/4/contributions/286/attachments/225/398/LPC-2019-OptSpin-Locks.pdf
Results on an AWS t3a.medium (EPYC 7571):
Running `target/release/lock-bench 32 2 10000 100`
Options {
n_threads: 32,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 15.824456ms min 13.662045ms max 17.81835ms
parking_lot::Mutex avg 6.66539ms min 1.039909ms max 7.653024ms
spin::Mutex avg 8.628183ms min 186.752µs max 45.851717ms
AmdSpinlock avg 6.841491ms min 1.073558ms max 34.397579ms
std::sync::Mutex avg 15.460703ms min 229.303µs max 17.468396ms
parking_lot::Mutex avg 6.554586ms min 629.526µs max 7.659823ms
spin::Mutex avg 9.246582ms min 206.732µs max 45.582034ms
AmdSpinlock avg 8.371108ms min 278.313µs max 39.010627ms
Results on my desktop (Ryzen 5 3600):
Running `target/release/lock-bench 32 2 10000 100`
Options {
n_threads: 32,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 27.942144ms min 25.432081ms max 32.041021ms
parking_lot::Mutex avg 38.694589ms min 14.405773ms max 47.409828ms
spin::Mutex avg 33.888182ms min 6.6821ms max 118.340512ms
AmdSpinlock avg 20.657143ms min 6.214608ms max 74.403512ms
std::sync::Mutex avg 27.751037ms min 25.481527ms max 29.442091ms
parking_lot::Mutex avg 39.590541ms min 10.887829ms max 46.828278ms
spin::Mutex avg 34.156485ms min 6.708959ms max 132.084823ms
AmdSpinlock avg 18.27833ms min 5.94831ms max 73.317943ms
I'm not sure why my desktop (12 threads at 4.2GHz) is so much slower across the board than EC2 (2 threads at 2.2GHz), but I can't reproduce this on either. For what it's worth, my desktop is running Linux 5.3.1 and EC2 is Linux 4.19.80. Both are running NixOS 19.09 with (mostly) identical configurations.
Results on Windows 10 (Ryzen 1800X):
Options {
n_threads: 32,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 31.686149ms min 29.2895ms max 40.4412ms
parking_lot::Mutex avg 119.149075ms min 82.528ms max 159.9172ms
spin::Mutex avg 52.355757ms min 48.4938ms max 82.7013ms
AmdSpinlock avg 57.400031ms min 52.13ms max 150.8743ms
std::sync::Mutex avg 31.95052ms min 29.553ms max 40.4812ms
parking_lot::Mutex avg 118.03958ms min 71.2781ms max 154.8971ms
spin::Mutex avg 52.760405ms min 44.0239ms max 135.9375ms
AmdSpinlock avg 55.782028ms min 50.2748ms max 76.1081ms
Options {
n_threads: 32,
n_locks: 64,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 6.539282ms min 6.2849ms max 7.2896ms
parking_lot::Mutex avg 2.124488ms min 1.6108ms max 3.6931ms
spin::Mutex avg 1.562838ms min 128.5µs max 2.5823ms
AmdSpinlock avg 2.425624ms min 1.0894ms max 33.0315ms
std::sync::Mutex avg 6.434743ms min 1.0237ms max 7.1225ms
parking_lot::Mutex avg 2.247272ms min 1.695ms max 5.4078ms
spin::Mutex avg 1.883035ms min 1.3943ms max 32.1562ms
AmdSpinlock avg 2.147803ms min 1.0586ms max 3.5042ms
Options {
n_threads: 32,
n_locks: 1000,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 3.864869ms min 2.0875ms max 4.5281ms
parking_lot::Mutex avg 1.097378ms min 953.6µs max 1.3924ms
spin::Mutex avg 2.175728ms min 563.7µs max 88.3553ms
AmdSpinlock avg 1.017223ms min 686.2µs max 1.4419ms
std::sync::Mutex avg 3.880485ms min 2.0537ms max 4.4228ms
parking_lot::Mutex avg 1.111407ms min 611.5µs max 1.5689ms
spin::Mutex avg 1.002341ms min 786.6µs max 1.9072ms
AmdSpinlock avg 1.022204ms min 718.5µs max 1.24ms
Options {
n_threads: 32,
n_locks: 1000000,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 7.099728ms min 474.8µs max 9.2633ms
parking_lot::Mutex avg 2.24719ms min 484.1µs max 2.6865ms
spin::Mutex avg 2.107813ms min 1.9117ms max 2.4531ms
AmdSpinlock avg 2.258429ms min 1.9603ms max 2.7379ms
std::sync::Mutex avg 7.444182ms min 6.6535ms max 10.615ms
parking_lot::Mutex avg 2.262105ms min 2.015ms max 2.7102ms
spin::Mutex avg 2.098281ms min 275.1µs max 2.5487ms
AmdSpinlock avg 2.148916ms min 1.9126ms max 2.6017ms
If measuring latency is flawed, trying to measure throughput is likely flawed as well.
I would think the best action going forward is ask for help from kernel gurus. Because I don't think any of this is measuring what you think