stress
stress copied to clipboard
Fix CPU hog function
cpu_hog: don't use non-reentrant rand() in threads, do smth with result
Previously, stress -c did a terrible job at actually loading the CPU; it was idle most of the times:
$> perf stat stress -c 16 -t 5
stress: info: [526148] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [526148] successful run completed in 5s
Performance counter stats for 'stress -c 16 -t 5':
79,580.45 msec task-clock:u # 15.910 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
309 page-faults:u # 3.883 /sec
418,716,815,425 cycles:u # 5.262 GHz
262,176,845,042 stalled-cycles-frontend:u # 62.61% frontend cycles idle
617,055,840,870 instructions:u # 1.47 insn per cycle
# 0.42 stalled cycles per insn
175,186,890,751 branches:u # 2.201 G/sec
269,450,686 branch-misses:u # 0.15% of all branches
5.001799550 seconds time elapsed
79.463002000 seconds user
0.007854000 seconds sys
This means that in more than half of the cycles, the CPU frontend
couldn't do something. Why? A perf record -g trace of the same
invocation tells us that the CPU is spending > 99% of its time in
__random, waiting for an integer comparison that involves a data load.
No surprise there: rand() relies on global state that needs to get synchronized.
With this percentage in mind, it's not so bad that the result of sqrt never got used.
This commit changes both:
- stores the result of sqrt in a volatile double
- to stay portable, and use a very small-state algorithm for pseudo-random number generation, we just inline xoroshiro128+ [1], which is under a MIT-0 style "dedication to public domain" license.
We still don't "spin on sqrt()", because floating point sqrt is very very fast on modern desktop/server CPUs; but at least we actually make the CPU do its rounds.
With this change, the statistic now looks like this:
stress: info: [580362] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [580362] successful run completed in 5s
Performance counter stats for '/home/marcus/.usrlocal/bin/stress -c 16 -t 5':
79,575.88 msec task-clock:u # 15.900 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
453 page-faults:u # 5.693 /sec
425,671,786,366 cycles:u # 5.349 GHz
139,385,837 stalled-cycles-frontend:u # 0.03% frontend cycles idle
1,055,461,772,875 instructions:u # 2.48 insn per cycle
# 0.00 stalled cycles per insn
45,889,827,063 branches:u # 576.680 M/sec
220,005 branch-misses:u # 0.00% of all branches
5.004837362 seconds time elapsed
79.455434000 seconds user
0.006330000 seconds sys
So, we're nearly doubling the number of actually executed instructions, proving that we're now really stressing our superscalar CPU
[1] https://prng.di.unimi.it/