context-switch
context-switch copied to clipboard
Data for different CPU architectures
It seems that the results vary a lot on different CPU architectures.
Testing on a Ubuntu VM (kernel version 5.4.0-65-generic
) running on Apple M1 with the thread-brigade
and async-brigade
tests:
$ /bin/time ../target/release/async-brigade
500 tasks, 10000 iterations:
mean 572.666µs per iteration, stddev 10.912µs (1145.000ns per task per iter)
2.56user 3.26system 0:05.83elapsed 99%CPU (0avgtext+0avgdata 3964maxresident)k
0inputs+0outputs (0major+399minor)pagefaults 0swaps
$ /bin/time ../target/release/thread-brigade
500 tasks, 10000 iterations:
mean 7.104ms per iteration, stddev 226.822µs (14.208µs per task per iter)
7.09user 78.75system 1:11.91elapsed 119%CPU (0avgtext+0avgdata 8340maxresident)k
0inputs+0outputs (0major+1523minor)pagefaults 0swaps
So it's a 90% speedup, not a 30% one.
Pinning to a single CPU core brings the threaded version closer to async though:
$ taskset --cpu-list 1 /bin/time ../target/release/thread-brigade
500 tasks, 10000 iterations:
mean 660.847µs per iteration, stddev 13.810µs (1321.000ns per task per iter)
0.49user 6.28system 0:06.83elapsed 99%CPU (0avgtext+0avgdata 6100maxresident)k
0inputs+0outputs (0major+1544minor)pagefaults 0swaps
Do the benchmarks not run directly on macOS?
I amended the README to say that I don't really understand why pinning to a single core speeds up thread-brigade. I mean, sure, I can guess that cross-core traffic is too slow or whatever, but that's not the same as actually knowing what is specifically happening.
I ran the tests in a Linux VM to keep the environment consistent with described in README.
Running natively on macOS:
% time ../target/release/async-brigade
500 tasks, 10000 iterations:
mean 677.307µs per iteration, stddev 10.876µs (1354.000ns per task per iter)
../target/release/async-brigade 3.20s user 3.69s system 99% cpu 6.894 total
% time ../target/release/thread-brigade
500 tasks, 10000 iterations:
mean 688.988µs per iteration, stddev 79.514µs (1377.000ns per task per iter)
../target/release/thread-brigade 0.89s user 6.17s system 100% cpu 7.015 total
Looks like there are some scheduler policy differences between Linux and macOS leading to the difference.
Wow, they're about the same.
And, if I may continue to impose, how about one-thread-brigade, to see how much time is due to the I/O alone?
@jimblandy
how about one-thread-brigade, to see how much time is due to the I/O alone?
macOS (M1):
% time ../target/release/one-thread-brigade
10000 iterations, 500 tasks, mean 259.929µs per iteration, stddev 29.523µs (519.000ns per task per iter)
../target/release/one-thread-brigade 0.69s user 1.92s system 99% cpu 2.617 total
Ubuntu (VM on M1)
$ /bin/time ../target/release/one-thread-brigade
10000 iterations, 500 tasks, mean 217.284µs per iteration, stddev 30.133µs (434.000ns per task per iter)
0.48user 1.70system 0:02.19elapsed 99%CPU (0avgtext+0avgdata 1664maxresident)k
0inputs+0outputs (0major+91minor)pagefaults 0swaps
Running natively on macOS:
% time ../target/release/async-brigade mean 677.307µs per iteration
Testing on a Ubuntu VM (kernel version 5.4.0-65-generic) running on Apple M1 with the thread-brigade and async-brigade tests:
$ /bin/time ../target/release/async-brigade mean 572.666µs per iteration
...
macOS (M1):
% time ../target/release/one-thread-brigade ...mean 259.929µs per iteration
Ubuntu (VM on M1)
$ /bin/time ../target/release/one-thread-brigade mean 217.284µs per
Ubuntu in a VM is outperforming native macOS? Does seems like a weird result.