context-switch icon indicating copy to clipboard operation
context-switch copied to clipboard

Data for different CPU architectures

Open losfair opened this issue 3 years ago • 5 comments

It seems that the results vary a lot on different CPU architectures.

Testing on a Ubuntu VM (kernel version 5.4.0-65-generic) running on Apple M1 with the thread-brigade and async-brigade tests:

$ /bin/time ../target/release/async-brigade 
500 tasks, 10000 iterations:
mean 572.666µs per iteration, stddev 10.912µs (1145.000ns per task per iter)
2.56user 3.26system 0:05.83elapsed 99%CPU (0avgtext+0avgdata 3964maxresident)k
0inputs+0outputs (0major+399minor)pagefaults 0swaps
$ /bin/time ../target/release/thread-brigade 
500 tasks, 10000 iterations:
mean 7.104ms per iteration, stddev 226.822µs (14.208µs per task per iter)
7.09user 78.75system 1:11.91elapsed 119%CPU (0avgtext+0avgdata 8340maxresident)k
0inputs+0outputs (0major+1523minor)pagefaults 0swaps

So it's a 90% speedup, not a 30% one.

Pinning to a single CPU core brings the threaded version closer to async though:

$ taskset --cpu-list 1 /bin/time ../target/release/thread-brigade 
500 tasks, 10000 iterations:
mean 660.847µs per iteration, stddev 13.810µs (1321.000ns per task per iter)
0.49user 6.28system 0:06.83elapsed 99%CPU (0avgtext+0avgdata 6100maxresident)k
0inputs+0outputs (0major+1544minor)pagefaults 0swaps

losfair avatar Feb 12 '21 03:02 losfair

Do the benchmarks not run directly on macOS?

I amended the README to say that I don't really understand why pinning to a single core speeds up thread-brigade. I mean, sure, I can guess that cross-core traffic is too slow or whatever, but that's not the same as actually knowing what is specifically happening.

jimblandy avatar Feb 12 '21 06:02 jimblandy

I ran the tests in a Linux VM to keep the environment consistent with described in README.

Running natively on macOS:

% time ../target/release/async-brigade
500 tasks, 10000 iterations:
mean 677.307µs per iteration, stddev 10.876µs (1354.000ns per task per iter)
../target/release/async-brigade  3.20s user 3.69s system 99% cpu 6.894 total
% time ../target/release/thread-brigade
500 tasks, 10000 iterations:
mean 688.988µs per iteration, stddev 79.514µs (1377.000ns per task per iter)
../target/release/thread-brigade  0.89s user 6.17s system 100% cpu 7.015 total

Looks like there are some scheduler policy differences between Linux and macOS leading to the difference.

losfair avatar Feb 12 '21 06:02 losfair

Wow, they're about the same.

And, if I may continue to impose, how about one-thread-brigade, to see how much time is due to the I/O alone?

jimblandy avatar Feb 12 '21 16:02 jimblandy

@jimblandy

how about one-thread-brigade, to see how much time is due to the I/O alone?

macOS (M1):

% time ../target/release/one-thread-brigade
10000 iterations, 500 tasks, mean 259.929µs per iteration, stddev 29.523µs (519.000ns per task per iter)
../target/release/one-thread-brigade  0.69s user 1.92s system 99% cpu 2.617 total

Ubuntu (VM on M1)

$ /bin/time ../target/release/one-thread-brigade 
10000 iterations, 500 tasks, mean 217.284µs per iteration, stddev 30.133µs (434.000ns per task per iter)
0.48user 1.70system 0:02.19elapsed 99%CPU (0avgtext+0avgdata 1664maxresident)k
0inputs+0outputs (0major+91minor)pagefaults 0swaps

losfair avatar Feb 13 '21 07:02 losfair

Running natively on macOS:

% time ../target/release/async-brigade
mean 677.307µs per iteration

Testing on a Ubuntu VM (kernel version 5.4.0-65-generic) running on Apple M1 with the thread-brigade and async-brigade tests:

$ /bin/time ../target/release/async-brigade 
mean 572.666µs per iteration

...

macOS (M1):

% time ../target/release/one-thread-brigade
...mean 259.929µs per iteration

Ubuntu (VM on M1)

$ /bin/time ../target/release/one-thread-brigade 
mean 217.284µs per 

Ubuntu in a VM is outperforming native macOS? Does seems like a weird result.

ehiggs avatar Feb 15 '21 22:02 ehiggs