stress icon indicating copy to clipboard operation
stress copied to clipboard

Fix CPU hog function

Open marcusmueller opened this issue 8 months ago • 0 comments

cpu_hog: don't use non-reentrant rand() in threads, do smth with result

Previously, stress -c did a terrible job at actually loading the CPU; it was idle most of the times:

$> perf stat stress -c 16 -t 5
stress: info: [526148] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [526148] successful run completed in 5s

 Performance counter stats for 'stress -c 16 -t 5':

         79,580.45 msec task-clock:u                     #   15.910 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               309      page-faults:u                    #    3.883 /sec
   418,716,815,425      cycles:u                         #    5.262 GHz
   262,176,845,042      stalled-cycles-frontend:u        #   62.61% frontend cycles idle
   617,055,840,870      instructions:u                   #    1.47  insn per cycle
                                                  #    0.42  stalled cycles per insn
   175,186,890,751      branches:u                       #    2.201 G/sec
       269,450,686      branch-misses:u                  #    0.15% of all branches

       5.001799550 seconds time elapsed

      79.463002000 seconds user
       0.007854000 seconds sys

This means that in more than half of the cycles, the CPU frontend couldn't do something. Why? A perf record -g trace of the same invocation tells us that the CPU is spending > 99% of its time in __random, waiting for an integer comparison that involves a data load.

No surprise there: rand() relies on global state that needs to get synchronized.

With this percentage in mind, it's not so bad that the result of sqrt never got used.

This commit changes both:

  • stores the result of sqrt in a volatile double
  • to stay portable, and use a very small-state algorithm for pseudo-random number generation, we just inline xoroshiro128+ [1], which is under a MIT-0 style "dedication to public domain" license.

We still don't "spin on sqrt()", because floating point sqrt is very very fast on modern desktop/server CPUs; but at least we actually make the CPU do its rounds.

With this change, the statistic now looks like this:

stress: info: [580362] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [580362] successful run completed in 5s

 Performance counter stats for '/home/marcus/.usrlocal/bin/stress -c 16 -t 5':

         79,575.88 msec task-clock:u                     #   15.900 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               453      page-faults:u                    #    5.693 /sec
   425,671,786,366      cycles:u                         #    5.349 GHz
       139,385,837      stalled-cycles-frontend:u        #    0.03% frontend cycles idle
 1,055,461,772,875      instructions:u                   #    2.48  insn per cycle
                                                  #    0.00  stalled cycles per insn
    45,889,827,063      branches:u                       #  576.680 M/sec
           220,005      branch-misses:u                  #    0.00% of all branches

       5.004837362 seconds time elapsed

      79.455434000 seconds user
       0.006330000 seconds sys

So, we're nearly doubling the number of actually executed instructions, proving that we're now really stressing our superscalar CPU

[1] https://prng.di.unimi.it/

marcusmueller avatar Feb 20 '25 23:02 marcusmueller