parking_lot icon indicating copy to clipboard operation
parking_lot copied to clipboard

Heavily degraded performance while in extreme contention on some processors

Open paulocsanz opened this issue 5 years ago • 43 comments

This blog post Mutexes Are Faster Than Spinlocks generated a reddit discussion about lock implementatons. This made some users benchmark parking_lot on windows and some results shown are very problematic, others could just be better (since parking_lot is always faster than spin on linux and mac, but not on windows for some reason).

I did not run this benchmark since I don't currently have windows installed, but since the user hadn't filed an issue I decided to post it here, their comment was: https://www.reddit.com/r/rust/comments/ejx7y8/blog_post_mutexes_are_faster_than_spinlocks/fd31um3/

The results:

Benchmark code: https://github.com/matklad/lock-bench Windows 10 Pro Intel Core i7-5930k @ 3.5 GHz stable-x86_64-pc-windows-msvc (default) rustc 1.40.0 (73528e339 2019-12-16)

extreme contention

cargo run --release 32 2 10000 100
    Finished release [optimized] target(s) in 0.03s
     Running `target\release\lock-bench.exe 32 2 10000 100`
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 32.452982ms  min 20.4146ms    max 45.2767ms
parking_lot::Mutex   avg 154.509064ms min 111.2522ms   max 180.4367ms
spin::Mutex          avg 46.3496ms    min 33.5478ms    max 56.1689ms
AmdSpinlock          avg 45.725299ms  min 32.1936ms    max 54.4236ms

std::sync::Mutex     avg 33.383154ms  min 18.2827ms    max 46.0634ms
parking_lot::Mutex   avg 134.983307ms min 95.5948ms    max 176.1896ms
spin::Mutex          avg 43.402769ms  min 31.9209ms    max 55.0075ms
AmdSpinlock          avg 39.572361ms  min 28.1705ms    max 50.2935ms

heavy contention

cargo run --release 32 64 10000 100
    Finished release [optimized] target(s) in 0.03s
     Running `target\release\lock-bench.exe 32 64 10000 100`
Options {
    n_threads: 32,
    n_locks: 64,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 12.8268ms    min 6.4807ms     max 14.174ms
parking_lot::Mutex   avg 8.470518ms   min 3.6558ms     max 10.0896ms
spin::Mutex          avg 6.356252ms   min 4.6299ms     max 8.1838ms
AmdSpinlock          avg 7.147972ms   min 5.7731ms     max 9.2027ms

std::sync::Mutex     avg 12.790879ms  min 3.7349ms     max 14.4933ms
parking_lot::Mutex   avg 8.526535ms   min 6.7143ms     max 10.0845ms
spin::Mutex          avg 5.730139ms   min 2.8063ms     max 7.6221ms
AmdSpinlock          avg 7.082415ms   min 5.2678ms     max 8.2064ms

light contention

cargo run --release 32 1000 10000 100
    Finished release [optimized] target(s) in 0.05s
     Running `target\release\lock-bench.exe 32 1000 10000 100`
Options {
    n_threads: 32,
    n_locks: 1000,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 7.736325ms   min 4.3287ms     max 9.194ms
parking_lot::Mutex   avg 4.912407ms   min 4.1386ms     max 5.9617ms
spin::Mutex          avg 3.787679ms   min 3.2468ms     max 4.8136ms
AmdSpinlock          avg 4.229783ms   min 1.0404ms     max 5.2414ms

std::sync::Mutex     avg 7.791248ms   min 6.2809ms     max 8.9858ms
parking_lot::Mutex   avg 4.933393ms   min 4.3319ms     max 6.1515ms
spin::Mutex          avg 3.782046ms   min 3.3339ms     max 5.4954ms
AmdSpinlock          avg 4.22442ms    min 3.1285ms     max 5.3338ms

no contention

cargo run --release 32 1000000 10000 100
    Finished release [optimized] target(s) in 0.03s
     Running `target\release\lock-bench.exe 32 1000000 10000 100`
Options {
    n_threads: 32,
    n_locks: 1000000,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 12.465917ms  min 8.8088ms     max 13.6216ms
parking_lot::Mutex   avg 5.164135ms   min 4.2478ms     max 6.1451ms
spin::Mutex          avg 4.112927ms   min 3.1624ms     max 5.599ms
AmdSpinlock          avg 4.302528ms   min 4.0533ms     max 5.4168ms

std::sync::Mutex     avg 11.765036ms  min 3.3567ms     max 13.5108ms
parking_lot::Mutex   avg 3.992219ms   min 2.4974ms     max 5.5604ms
spin::Mutex          avg 3.425334ms   min 2.0133ms     max 4.7788ms
AmdSpinlock          avg 3.813034ms   min 2.2009ms     max 5.0947ms

paulocsanz avatar Jan 04 '20 18:01 paulocsanz

I wonder if the result would be different if we used SwitchToThread instead of Sleep when yielding. Can you test this? I don't have a Windows system.

Amanieu avatar Jan 04 '20 18:01 Amanieu

I don't have a windows system either, the benchmark wasn't done by me, but another user got similar results in linux, maybe there is a common denominator.

https://www.reddit.com/r/rust/comments/ejx7y8/blog_post_mutexes_are_faster_than_spinlocks/fd3taac/

Linux system (rustc 1.41.0-nightly 2019-12-05, AMD 3900x 12 cores with SMT).

extreme contention

❯ cargo run --release 32 2 10000 100
    Finished release [optimized] target(s) in 0.00s
     Running `target/release/lock-bench 32 2 10000 100`
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 39.63915ms   min 34.618755ms  max 51.911789ms 
parking_lot::Mutex   avg 222.896391ms min 214.575148ms max 226.433204ms
spin::Mutex          avg 20.253741ms  min 12.694752ms  max 38.933376ms 
AmdSpinlock          avg 17.53803ms   min 11.353536ms  max 51.322618ms 

std::sync::Mutex     avg 39.423473ms  min 33.850454ms  max 47.47324ms  
parking_lot::Mutex   avg 222.267268ms min 217.754466ms max 226.037187ms
spin::Mutex          avg 20.186599ms  min 12.566426ms  max 62.728842ms 
AmdSpinlock          avg 17.215404ms  min 11.445496ms  max 46.907045ms 

heavy contention

❯ cargo run --release 32 64 10000 100
    Finished release [optimized] target(s) in 0.00s
     Running `target/release/lock-bench 32 64 10000 100`
Options {
    n_threads: 32,
    n_locks: 64,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 8.144328ms   min 7.676202ms   max 8.855408ms  
parking_lot::Mutex   avg 6.590482ms   min 1.666855ms   max 8.721845ms  
spin::Mutex          avg 15.085528ms  min 1.510395ms   max 42.460191ms 
AmdSpinlock          avg 9.331913ms   min 1.681545ms   max 24.24093ms  

std::sync::Mutex     avg 8.117876ms   min 7.600261ms   max 8.398674ms  
parking_lot::Mutex   avg 5.605486ms   min 1.647048ms   max 8.671342ms  
spin::Mutex          avg 12.872803ms  min 1.517989ms   max 39.331793ms 
AmdSpinlock          avg 8.278936ms   min 1.779218ms   max 34.416964ms 

light contention

❯ cargo run --release 32 1000 10000 100
    Finished release [optimized] target(s) in 0.00s
     Running `target/release/lock-bench 32 1000 10000 100`
Options {
    n_threads: 32,
    n_locks: 1000,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 4.673801ms   min 4.271466ms   max 5.416596ms  
parking_lot::Mutex   avg 1.379981ms   min 1.12888ms    max 1.714049ms  
spin::Mutex          avg 14.5374ms    min 1.050929ms   max 46.961405ms 
AmdSpinlock          avg 8.405825ms   min 1.172899ms   max 31.04467ms  

std::sync::Mutex     avg 4.660858ms   min 4.333317ms   max 5.126614ms  
parking_lot::Mutex   avg 1.379758ms   min 1.176389ms   max 1.749378ms  
spin::Mutex          avg 14.796396ms  min 1.039289ms   max 38.121532ms 
AmdSpinlock          avg 7.045806ms   min 1.189589ms   max 32.977048ms 

no contention

❯ cargo run --release 32 1000000 10000 100
    Finished release [optimized] target(s) in 0.00s
     Running `target/release/lock-bench 32 1000000 10000 100`
Options {
    n_threads: 32,
    n_locks: 1000000,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 5.488052ms   min 4.789075ms   max 5.913014ms  
parking_lot::Mutex   avg 1.570826ms   min 1.294428ms   max 1.826788ms  
spin::Mutex          avg 1.383231ms   min 1.162079ms   max 1.678709ms  
AmdSpinlock          avg 1.363113ms   min 1.120449ms   max 1.582918ms  

std::sync::Mutex     avg 5.525267ms   min 4.877406ms   max 5.907605ms  
parking_lot::Mutex   avg 1.586628ms   min 1.317512ms   max 2.012493ms  
spin::Mutex          avg 1.388559ms   min 1.231672ms   max 1.603962ms  
AmdSpinlock          avg 1.38805ms    min 1.145911ms   max 1.590503ms

paulocsanz avatar Jan 04 '20 18:01 paulocsanz

I tested on my machine on linux and cannot reproduce. Maybe it's a question about the entire system more than the operating system.

I'm using Linux, rustc 1.40.0 (73528e339 2019-12-16) and Intel i5-8250U 8 cores

I realize it's not ideal not having anyone able to reproduce, but the results were so unexpected that they deserved a bigger investigation. Specially when parking_lot will be integrated with std.

paulocsanz avatar Jan 04 '20 19:01 paulocsanz

I ran the same tests on windows instead of Linux. The results are here.

Also a run of the "extreme contention" case with SwitchToThread instead of Sleep:

     Running `target\release\lock-bench.exe 32 2 10000 100`
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 48.186484ms  min 46.201ms     max 68.707ms
parking_lot::Mutex   avg 190.429647ms min 185.9003ms   max 195.4049ms
spin::Mutex          avg 8.325163ms   min 7.919ms      max 9.3455ms
AmdSpinlock          avg 7.891016ms   min 7.5464ms     max 9.2582ms

std::sync::Mutex     avg 49.480777ms  min 45.7975ms    max 68.8675ms
parking_lot::Mutex   avg 192.0885ms   min 188.0831ms   max 195.9926ms
spin::Mutex          avg 8.310387ms   min 8.0344ms     max 10.1583ms
AmdSpinlock          avg 7.894169ms   min 7.5514ms     max 8.7815ms

mschlumpp avatar Jan 04 '20 20:01 mschlumpp

I've run the benchmark on windows (the same machine from blog post, just different OS), and can't reproduce perf degradation of parking lot.

matklad avatar Jan 04 '20 20:01 matklad

@cynecx yes that's correct.

I just tried it on some other hardware: i7-4770T (Haswell, Desktop)

lock-bench> cargo run --release 32 2 10000 100
    Finished release [optimized] target(s) in 0.00s
     Running `target/release/lock-bench 32 2 10000 100`
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 29.457212ms  min 25.718361ms  max 31.862572ms 
parking_lot::Mutex   avg 14.884388ms  min 11.719785ms  max 17.62964ms  
spin::Mutex          avg 41.307676ms  min 20.261094ms  max 94.301218ms 
AmdSpinlock          avg 45.520435ms  min 21.818796ms  max 135.193445ms

std::sync::Mutex     avg 29.411401ms  min 24.678649ms  max 32.887241ms 
parking_lot::Mutex   avg 15.304354ms  min 11.821305ms  max 17.14152ms  
spin::Mutex          avg 39.935237ms  min 18.339029ms  max 103.907344ms
AmdSpinlock          avg 45.176324ms  min 18.215179ms  max 99.126875ms 

i5-6200U (Skylake, Notebook)

     Running `target/release/lock-bench 32 2 10000 100`
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 111.925873ms min 31.216135ms  max 128.665372ms
parking_lot::Mutex   avg 37.560599ms  min 33.474334ms  max 49.060267ms 
spin::Mutex          avg 121.052961ms min 24.271047ms  max 187.949351ms
AmdSpinlock          avg 122.133179ms min 16.167021ms  max 178.323671ms

std::sync::Mutex     avg 113.942109ms min 104.396629ms max 123.955679ms
parking_lot::Mutex   avg 37.321911ms  min 33.046695ms  max 43.749566ms 
spin::Mutex          avg 125.359875ms min 24.666258ms  max 182.409344ms
AmdSpinlock          avg 117.566367ms min 16.328205ms  max 176.297166ms

The Intel Core i7-5930k of the initial windows post seems to be an Haswell-E.

mschlumpp avatar Jan 04 '20 20:01 mschlumpp

@matklad It seems to be related to the used microprocessor architecture. @theunknownxy's results show the same degradation on both oses (Windows and Linux).

EDIT: I've accidentally removed my previous comment.

I am not entirely sure but, perhaps this has to do with the SpinWait? Since its spin-wait is using spin_loop_hint which basically lowers to a pause instruction. And it's quite known that the pause instruction's latency varies on different microarchitectures.

Perhaps someone could experiment with lowering the condition here: https://github.com/Amanieu/parking_lot/blob/8f69413ae8271a2a2e9067e61392938f6dadd59c/core/src/spinwait.rs#L53

cynecx avatar Jan 04 '20 21:01 cynecx

@cynecx Lowering the value in the condition does not seem to have an impact on the results. self.counter <= 1:

     Running `target\release\lock-bench.exe 32 2 10000 100`
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 49.474712ms  min 46.0883ms    max 69.2954ms
parking_lot::Mutex   avg 191.411675ms min 187.1015ms   max 194.4445ms
spin::Mutex          avg 8.319929ms   min 8.0446ms     max 9.2257ms
AmdSpinlock          avg 7.870013ms   min 7.4975ms     max 8.8647ms

std::sync::Mutex     avg 48.940684ms  min 46.2259ms    max 68.8908ms
parking_lot::Mutex   avg 191.403276ms min 187.6034ms   max 194.5025ms
spin::Mutex          avg 8.540062ms   min 7.9655ms     max 33.9991ms
AmdSpinlock          avg 7.869302ms   min 7.6215ms     max 8.4093ms

always using thread_yield (ignoring the counter):

     Running `target\release\lock-bench.exe 32 2 10000 100`
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 48.54949ms   min 46.0273ms    max 68.669ms
parking_lot::Mutex   avg 189.396441ms min 184.3732ms   max 193.476ms
spin::Mutex          avg 8.366757ms   min 7.4263ms     max 10.4656ms
AmdSpinlock          avg 7.893677ms   min 7.4782ms     max 9.6092ms

std::sync::Mutex     avg 48.805392ms  min 46.4548ms    max 68.7367ms
parking_lot::Mutex   avg 189.496029ms min 184.8406ms   max 194.1087ms
spin::Mutex          avg 8.632158ms   min 7.6492ms     max 40.5276ms
AmdSpinlock          avg 7.874248ms   min 7.5315ms     max 8.7431ms

mschlumpp avatar Jan 04 '20 21:01 mschlumpp

Linux 4.15.0-72, AMD Ryzen 7 2700x (8 physical cores, enabled SMT), "extreme contention":

std::sync::Mutex     avg 32.08904ms   min 28.112784ms  max 33.733244ms 
parking_lot::Mutex   avg 148.294697ms min 141.888311ms max 150.374082ms
spin::Mutex          avg 87.941173ms  min 46.837399ms  max 158.922402ms
AmdSpinlock          avg 78.83599ms   min 47.9706ms    max 133.216168ms

newpavlov avatar Jan 04 '20 23:01 newpavlov

I'd recommend looking closer at AMD, especially ZEN based machines, as they seem to be the common denominator so far.

Voultapher avatar Jan 05 '20 00:01 Voultapher

I'd recommend looking closer at AMD, especially ZEN based machines, as they seem to be the common denominator so far.

@Voultapher

The first windows benchmark is Intel, that specific model may be doing something similar to most AMD tho. Possibly related to Hyperthreading/SMT. Maybe if SMT is disabled and the bench is re-run the performance degradation for parking_lot will disappear?

paulocsanz avatar Jan 05 '20 01:01 paulocsanz

Disabling SMT improves situation, but the problem is still present:

std::sync::Mutex     avg 30.765218ms  min 27.16788ms   max 34.434105ms 
parking_lot::Mutex   avg 80.858621ms  min 25.013774ms  max 112.516787ms
spin::Mutex          avg 73.487341ms  min 24.467014ms  max 178.773357ms
AmdSpinlock          avg 72.41482ms   min 28.667988ms  max 173.171411ms

newpavlov avatar Jan 05 '20 01:01 newpavlov

Well, I don’t have access to an AMD Ryzen (and I can’t find a cloud-service which specifically offers them) but for those who have, perhaps some profiling (eg. perf) can shed some light?

cynecx avatar Jan 05 '20 01:01 cynecx

Well, it seems that EC2 does have AMD instances (AMD EPYC 7571), so I got this:

System:

  • EC2 t3a.2xlarge
  • 8 vCores (AMD EPYC 7571)
  • 32GB RAM
  • Linux: 4.15.0-1051-aws
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 39.300861ms  min 35.752879ms  max 41.433415ms
parking_lot::Mutex   avg 36.729054ms  min 35.275409ms  max 38.59546ms
spin::Mutex          avg 92.0057ms    min 40.678487ms  max 206.923339ms
AmdSpinlock          avg 70.080664ms  min 34.398632ms  max 193.121841ms

std::sync::Mutex     avg 39.521573ms  min 36.72074ms   max 41.15228ms
parking_lot::Mutex   avg 36.669689ms  min 34.570288ms  max 38.946108ms
spin::Mutex          avg 96.520724ms  min 40.866666ms  max 207.652942ms
AmdSpinlock          avg 72.529016ms  min 33.687999ms  max 196.055792ms
Options {
    n_threads: 8,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 9.708629ms   min 8.964155ms   max 10.842344ms
parking_lot::Mutex   avg 8.112247ms   min 7.52242ms    max 8.667363ms
spin::Mutex          avg 10.472547ms  min 7.978397ms   max 11.968051ms
AmdSpinlock          avg 8.328134ms   min 7.276999ms   max 23.551344ms

std::sync::Mutex     avg 9.768446ms   min 8.710725ms   max 11.195713ms
parking_lot::Mutex   avg 8.066572ms   min 6.970456ms   max 8.696175ms
spin::Mutex          avg 10.505385ms  min 7.947568ms   max 11.267302ms
AmdSpinlock          avg 8.161658ms   min 5.269307ms   max 8.675885ms

Doesn't seem to be affected :/

EDIT:

Also according to AMD's SKU naming scheme, this cpu is basically based on first generation Zen, however most of the problematic bechmarks above were run on second generation Zen.

cynecx avatar Jan 05 '20 03:01 cynecx

Here are some results from a 3900x on windows:

$ cargo run --release 32 2 10000 100
std::sync::Mutex     avg 46.573633ms  min 44.3294ms    max 65.4726ms
parking_lot::Mutex   avg 181.645635ms min 106.3233ms   max 185.5278ms
spin::Mutex          avg 8.439861ms   min 7.9094ms     max 10.1592ms
AmdSpinlock          avg 7.834648ms   min 7.4119ms     max 8.2538ms

It's obvious that the spinlock solution would do fairly well on that kind of machine. There are plenty of cores around, so the spinning threads don't need to get preempted that often. However the parking_lot vs std::mutex difference is really big.

I thin the conclusion that it's AMD specific didn't hold true so far, since the initial post used a Intel CPU. And it also doesn't seem to be constrainted to Windows, since someone on a 3900x on Linux also observed a degradation.

Interesting thing. I will try play around with some settings

Matthias247 avatar Jan 05 '20 06:01 Matthias247

SwitchToThread vs Sleep makes no difference for me.

I got a large improvement by increasing the amount of spins before a sleep. But it took really a very big increase, so my concern is I thereby just degraded it to the Spinlock:

pub fn spin(&mut self) -> bool {
    if self.counter >= 20 {
        return false;
    }
    self.counter += 1;
    if self.counter <= 15 {
        cpu_relax(1 << self.counter);
    } else {
        thread_parker::thread_yield();
    }
    true
}

Result:

parking_lot::Mutex avg 12.178126ms min 11.6343ms max 14.1958ms

So this goes from 2^3 spins to 2^15. Even with 2^9 there was no noticable improvement. Increasing the number of thread_parker::thread_yield() calls always decreases the performance. Therefore it seems like that kind of yielding to the OS might be inefficient.

E.g. going to

if self.counter >= 16 {
    return false;
 }

which avoids all thread yield calls just changes the result to

parking_lot::Mutex avg 12.370461ms min 11.8578ms max 14.3432ms

Matthias247 avatar Jan 05 '20 06:01 Matthias247

With disabled Hyperthreading I most times get

parking_lot::Mutex avg 80.075459ms min 47.6624ms max 98.0569ms

but at some times also half of it

parking_lot::Mutex avg 36.267395ms min 12.441ms max 53.9117ms

But it pretty much consistently changes between 40 and 80ms, with nothing in between 😵

Without Hyperthreading I also get better results if I do not increase the overall spins, but insert busy-spins before the OS yield (in a similiar fashion as crossbeam Backoff does):

pub fn spin(&mut self) -> bool {
    if self.counter >= 10 {
        return false;
    }
    self.counter += 1;
    if self.counter <= 3 {
        cpu_relax(1 << self.counter);
    } else {
        cpu_relax(1 << self.counter);
        thread_parker::thread_yield();
    }
    true
}

parking_lot::Mutex avg 15.98723ms min 14.0822ms max 18.4982ms

With activated Hyperthreading this actually did not make that much of a difference

Matthias247 avatar Jan 05 '20 07:01 Matthias247

You can read linus thoughts about thread_yield in this post: https://www.realworldtech.com/forum/?threadid=189711&curpostid=189752.

Speedy37 avatar Jan 09 '20 03:01 Speedy37

I am also seeing the issue on my desktop:

  • Intel Xeon E5-2670 x 2 (2 processors, 8 cores per processor, 2 threads per core)

  • 64 GB RAM

  • Linux 5.3.0-24-generic

It should be noted that this is a NUMA system as mentioned in Linus' reply to the article.

Here are the results for extreme contention (Heavy, light, and none look okay):

$ cargo run --release 32 2 10000 100

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 55.576253ms  min 37.536375ms  max 63.957731ms 
parking_lot::Mutex   avg 228.590416ms min 188.487466ms max 239.946827ms
spin::Mutex          avg 44.905108ms  min 31.901131ms  max 67.000027ms 
AmdSpinlock          avg 46.383674ms  min 32.800095ms  max 93.837092ms 

I also tried some of the modifications from @Matthias247

2^15 spins

std::sync::Mutex     avg 56.492339ms  min 37.714216ms  max 84.766978ms 
parking_lot::Mutex   avg 84.547093ms  min 14.349955ms  max 132.215521ms
spin::Mutex          avg 47.759024ms  min 33.021815ms  max 77.174472ms 
AmdSpinlock          avg 46.540309ms  min 32.153268ms  max 68.264111ms 

2^15 spins, never yield

std::sync::Mutex     avg 55.81399ms   min 37.438165ms  max 78.098643ms 
parking_lot::Mutex   avg 113.635406ms min 31.291235ms  max 167.27489ms 
spin::Mutex          avg 49.064326ms  min 34.045076ms  max 74.999092ms 
AmdSpinlock          avg 47.007803ms  min 31.443733ms  max 81.399413ms 

2^3 spins immediately before yield

std::sync::Mutex     avg 56.807845ms  min 37.624857ms  max 80.015346ms 
parking_lot::Mutex   avg 207.616094ms min 119.964334ms max 220.407735ms
spin::Mutex          avg 45.066819ms  min 31.970404ms  max 70.265231ms 
AmdSpinlock          avg 45.483723ms  min 33.518564ms  max 65.345176ms 

jackguo380 avatar Jan 09 '20 05:01 jackguo380

I can reproduce on ThreadRipper 3970X.

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 43.973765ms  min 36.940827ms  max 48.005445ms 
parking_lot::Mutex   avg 287.913565ms min 258.751948ms max 304.911923ms
spin::Mutex          avg 25.721195ms  min 19.751268ms  max 48.85489ms  
AmdSpinlock          avg 24.204215ms  min 17.663769ms  max 49.178313ms 

std::sync::Mutex     avg 44.113866ms  min 39.096522ms  max 48.283289ms 
parking_lot::Mutex   avg 289.014664ms min 267.429467ms max 305.30055ms 
spin::Mutex          avg 26.11682ms   min 19.584014ms  max 60.147676ms 
AmdSpinlock          avg 24.284636ms  min 17.043627ms  max 67.09198ms  

Ran a modified version running only one std::sync::Mutex loop under perf stat:

std::sync::Mutex     avg 43.501474ms  min 38.28496ms   max 47.539912ms 

 Performance counter stats for 'cargo run --release 32 2 10000 100':

        119,327.91 msec task-clock                #    8.225 CPUs utilized          
         2,388,322      context-switches          #    0.020 M/sec                  
            25,909      cpu-migrations            #    0.217 K/sec                  
             4,142      page-faults               #    0.035 K/sec                  
   438,819,580,111      cycles                    #    3.677 GHz                      (84.48%)
   332,632,165,301      stalled-cycles-frontend   #   75.80% frontend cycles idle     (82.91%)
    10,515,010,024      stalled-cycles-backend    #    2.40% backend cycles idle      (82.22%)
    50,102,474,576      instructions              #    0.11  insn per cycle         
                                                  #    6.64  stalled cycles per insn  (82.12%)
    11,335,523,432      branches                  #   94.995 M/sec                    (83.25%)
       157,822,216      branch-misses             #    1.39% of all branches          (85.02%)

      14.508207241 seconds time elapsed

       9.607140000 seconds user
     109.520591000 seconds sys

And with parking_log::Mutex only:

parking_lot::Mutex   avg 273.396674ms min 254.347881ms max 289.119739ms

 Performance counter stats for 'cargo run --release 32 2 10000 100':

        222,159.62 msec task-clock                #    5.920 CPUs utilized          
        21,304,902      context-switches          #    0.096 M/sec                  
           378,781      cpu-migrations            #    0.002 M/sec                  
             4,141      page-faults               #    0.019 K/sec                  
   444,159,684,024      cycles                    #    1.999 GHz                      (83.12%)
    85,009,073,534      stalled-cycles-frontend   #   19.14% frontend cycles idle     (83.38%)
   135,860,444,447      stalled-cycles-backend    #   30.59% backend cycles idle      (83.66%)
   184,364,949,242      instructions              #    0.42  insn per cycle         
                                                  #    0.74  stalled cycles per insn  (83.62%)
    41,792,087,226      branches                  #  188.117 M/sec                    (83.24%)
       615,614,410      branch-misses             #    1.47% of all branches          (82.98%)

      37.525030261 seconds time elapsed

      49.254354000 seconds user
     174.546567000 seconds sys

The GHz numbers caught my attention, on top of everything else, and empirically the CPU usage levels during a run in htop seems to indicate less CPU used overall while the parking_lot::Mutex test runs, compared to std::sync::Mutex. I haven't looked to get more accurate data, though. Anyways, that got me wondering what the result would look like if the cpu frequency governor was changed from ondemand to performance, and this results in:

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 41.136917ms  min 33.999931ms  max 49.160068ms 
parking_lot::Mutex   avg 208.116968ms min 196.38674ms  max 216.458502ms
spin::Mutex          avg 18.496924ms  min 13.221013ms  max 43.77934ms  
AmdSpinlock          avg 15.954826ms  min 11.889577ms  max 39.562233ms 

std::sync::Mutex     avg 40.888678ms  min 36.336717ms  max 44.511702ms 
parking_lot::Mutex   avg 206.536087ms min 189.072666ms max 215.691531ms
spin::Mutex          avg 18.482265ms  min 14.164115ms  max 63.214092ms 
AmdSpinlock          avg 15.733487ms  min 12.418109ms  max 38.475479ms 

FWIW.

I'm not going to dive deeper into this right now, but I hope the data above can be useful somehow.

glandium avatar Jan 09 '20 06:01 glandium

Are you sure you're measuring anything valid? Per Linus' response the code doesn't really measure what the blog poster thinks it's measuring.

DanielJoyce avatar Jan 09 '20 09:01 DanielJoyce

This is mostly unrelated to the issue at hand, but, just to clear some confusion, the benchmark in this issue is from this blog post, while Linus is responding to another post. The benchmarks are different: the first tries to measure throughput, the second tries to measure latency.

This obviously doesn't mean that my benchmark isn't flawed in some terrible way as well :-)

matklad avatar Jan 09 '20 11:01 matklad

Basically what this test is measuring is how effective the spin back-off is. parking_lot uses an exponential back-off strategy where it doubles the about of time it pauses between spins every time it misses the lock. This effectively allows the thread which currently holds the lock to rapidly lock/unlock without any cacheline interference from other threads.

The reason why you are seeing different results on different microarchitectures basically boils down to a change Intel made in the Skylake and later microarchitectures which changed the duration of the PAUSE x86 instruction from 10 cycles to 140 cycles (relevant issue in .NET). Since I tuned this delay loop on my personal computer, which has a Skylake CPU, the delay loop is much shorter (by about 10x) on other microarchitectures.

Amanieu avatar Jan 09 '20 11:01 Amanieu

So from reading Linus' posts and other info the heavy degraded performance could be related to NUMA bouncing. The original WTF::Lock https://webkit.org/blog/6161/locking-in-webkit/ uses sched_yield(), so it seems does parking_lot which uses std::thread::yield_now() which in turn uses libc::sched_yield().

Linus talks a lot about informing the kernel about which threads are waiting on each other, are there APIs for that?

Voultapher avatar Jan 09 '20 11:01 Voultapher

Hmm, the man page for sched_yield says:

sched_yield() is intended for use with real-time scheduling policies
(i.e., SCHED_FIFO or SCHED_RR).  Use of sched_yield() with
nondeterministic scheduling policies such as SCHED_OTHER is
unspecified and very likely means your application design is broken.

The rust documentation for std::thread::yield_now() reads very different https://doc.rust-lang.org/std/thread/fn.yield_now.html. Clearly there seems to be a disconnect between what kernel developers want this to be used for, and what many people are really using it for. I'm not aware of a more appropriate API for those use cases, short of sticking to kernel supported synchronization. Is this an API hole, or a more fundamental design issue?

Voultapher avatar Jan 09 '20 11:01 Voultapher

My current thoughts about this:

  • Even if thread_yield might output the best latency/throughput in microbenchmarks, these benchmarks are flawed as they only consider themselves. So any other workload on the machine can impact these benchmarks by a lot. And I don't think I saw a benchmark that tries to see what impacts these lock mechanisms have on other running processes or on total CPU power consumption. Doing so might make benchmarks a little less broken. I might be wrong tho.
  • thread_yield (at least on linux) is just a hint for the scheduler, as Linus stated in another post (read the thread it's very interesting), the scheduler might not schedule the thread you want even If you just "liberated" a slot for it to run as this other thread affinity might be to another CPU core or worse to another NUMA node.

Edit some kernel related links about this:

  • http://www.linux-magazine.com/Issues/2016/193/Kernel-News
  • https://www.linuxplumbersconf.org/event/4/contributions/286/attachments/225/398/LPC-2019-OptSpin-Locks.pdf

Speedy37 avatar Jan 09 '20 12:01 Speedy37

Results on an AWS t3a.medium (EPYC 7571):

     Running `target/release/lock-bench 32 2 10000 100`                                                                                      
Options {                                                                                                                                    
    n_threads: 32,                                                                                                                           
    n_locks: 2,                                                                                                                              
    n_ops: 10000,                                                                                                                            
    n_rounds: 100,                                                                                                                           
}                                                                                                                                            

std::sync::Mutex     avg 15.824456ms  min 13.662045ms  max 17.81835ms                                                                        
parking_lot::Mutex   avg 6.66539ms    min 1.039909ms   max 7.653024ms                                                                        
spin::Mutex          avg 8.628183ms   min 186.752µs    max 45.851717ms                                                                       
AmdSpinlock          avg 6.841491ms   min 1.073558ms   max 34.397579ms                                                                       

std::sync::Mutex     avg 15.460703ms  min 229.303µs    max 17.468396ms                                                                       
parking_lot::Mutex   avg 6.554586ms   min 629.526µs    max 7.659823ms                                                                        
spin::Mutex          avg 9.246582ms   min 206.732µs    max 45.582034ms                                                                       
AmdSpinlock          avg 8.371108ms   min 278.313µs    max 39.010627ms                                                                       

Results on my desktop (Ryzen 5 3600):

     Running `target/release/lock-bench 32 2 10000 100`                                                                                      
Options {                                                                                                                                    
    n_threads: 32,                                                                                                                           
    n_locks: 2,                                                                                                                              
    n_ops: 10000,                                                                                                                            
    n_rounds: 100,                                                                                                                           
}                                                                                                                                            

std::sync::Mutex     avg 27.942144ms  min 25.432081ms  max 32.041021ms                                                                       
parking_lot::Mutex   avg 38.694589ms  min 14.405773ms  max 47.409828ms                                                                       
spin::Mutex          avg 33.888182ms  min 6.6821ms     max 118.340512ms                                                                      
AmdSpinlock          avg 20.657143ms  min 6.214608ms   max 74.403512ms                                                                       

std::sync::Mutex     avg 27.751037ms  min 25.481527ms  max 29.442091ms                                                                       
parking_lot::Mutex   avg 39.590541ms  min 10.887829ms  max 46.828278ms                                                                       
spin::Mutex          avg 34.156485ms  min 6.708959ms   max 132.084823ms                                                                      
AmdSpinlock          avg 18.27833ms   min 5.94831ms    max 73.317943ms                                                                       

I'm not sure why my desktop (12 threads at 4.2GHz) is so much slower across the board than EC2 (2 threads at 2.2GHz), but I can't reproduce this on either. For what it's worth, my desktop is running Linux 5.3.1 and EC2 is Linux 4.19.80. Both are running NixOS 19.09 with (mostly) identical configurations.

leo60228 avatar Jan 09 '20 13:01 leo60228

Results on Windows 10 (Ryzen 1800X):

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 31.686149ms  min 29.2895ms    max 40.4412ms
parking_lot::Mutex   avg 119.149075ms min 82.528ms     max 159.9172ms
spin::Mutex          avg 52.355757ms  min 48.4938ms    max 82.7013ms
AmdSpinlock          avg 57.400031ms  min 52.13ms      max 150.8743ms

std::sync::Mutex     avg 31.95052ms   min 29.553ms     max 40.4812ms
parking_lot::Mutex   avg 118.03958ms  min 71.2781ms    max 154.8971ms
spin::Mutex          avg 52.760405ms  min 44.0239ms    max 135.9375ms
AmdSpinlock          avg 55.782028ms  min 50.2748ms    max 76.1081ms
Options {
    n_threads: 32,
    n_locks: 64,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 6.539282ms   min 6.2849ms     max 7.2896ms
parking_lot::Mutex   avg 2.124488ms   min 1.6108ms     max 3.6931ms
spin::Mutex          avg 1.562838ms   min 128.5µs      max 2.5823ms
AmdSpinlock          avg 2.425624ms   min 1.0894ms     max 33.0315ms

std::sync::Mutex     avg 6.434743ms   min 1.0237ms     max 7.1225ms
parking_lot::Mutex   avg 2.247272ms   min 1.695ms      max 5.4078ms
spin::Mutex          avg 1.883035ms   min 1.3943ms     max 32.1562ms
AmdSpinlock          avg 2.147803ms   min 1.0586ms     max 3.5042ms
Options {
    n_threads: 32,
    n_locks: 1000,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 3.864869ms   min 2.0875ms     max 4.5281ms
parking_lot::Mutex   avg 1.097378ms   min 953.6µs      max 1.3924ms
spin::Mutex          avg 2.175728ms   min 563.7µs      max 88.3553ms
AmdSpinlock          avg 1.017223ms   min 686.2µs      max 1.4419ms

std::sync::Mutex     avg 3.880485ms   min 2.0537ms     max 4.4228ms
parking_lot::Mutex   avg 1.111407ms   min 611.5µs      max 1.5689ms
spin::Mutex          avg 1.002341ms   min 786.6µs      max 1.9072ms
AmdSpinlock          avg 1.022204ms   min 718.5µs      max 1.24ms
Options {
    n_threads: 32,
    n_locks: 1000000,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 7.099728ms   min 474.8µs      max 9.2633ms
parking_lot::Mutex   avg 2.24719ms    min 484.1µs      max 2.6865ms
spin::Mutex          avg 2.107813ms   min 1.9117ms     max 2.4531ms
AmdSpinlock          avg 2.258429ms   min 1.9603ms     max 2.7379ms

std::sync::Mutex     avg 7.444182ms   min 6.6535ms     max 10.615ms
parking_lot::Mutex   avg 2.262105ms   min 2.015ms      max 2.7102ms
spin::Mutex          avg 2.098281ms   min 275.1µs      max 2.5487ms
AmdSpinlock          avg 2.148916ms   min 1.9126ms     max 2.6017ms

Speedy37 avatar Jan 09 '20 13:01 Speedy37

If measuring latency is flawed, trying to measure throughput is likely flawed as well.

DanielJoyce avatar Jan 09 '20 18:01 DanielJoyce

I would think the best action going forward is ask for help from kernel gurus. Because I don't think any of this is measuring what you think

DanielJoyce avatar Jan 09 '20 18:01 DanielJoyce