HVM icon indicating copy to clipboard operation
HVM copied to clipboard

Multithreaded sequential programs spend most their active time waiting

Open developedby opened this issue 3 years ago • 4 comments

For programs that are dominated by expressions that must be reduced in sequential order, there is a very large performance drop proportional to the number of threads used.

Take this example program:

(Func 0 b) = b
(Func a b) = (Func (- a 1) (^ (<< b 1) b))

(Main) = (Func 100000 1)

Running with one thread, I get ~30MR/s, while with 2 threads I get ~20MR/s and with 16 threads ~10MR/s.

Analizing what the program spends its time on using perf, it looks like most of the time the threads are trying to steal a job from another task, failing (because this program is inherently sequential and there is not enough parallel work for all the threads), and the waiting to try again.

While it's true that we should try to write parallelizable programs for the HVM, for tasks that are mostly sequential we should have better behaviour.

developedby avatar Dec 06 '22 17:12 developedby

I think the goal should be to avoid unproductive stealing with suitable heuristics, well-placed back-offs, etc. If threads spend a lot of effort stealing the work of each-other and no parallelism is had, that should be identifiable so we can just stop doing that - right?

VictorTaelin avatar Dec 07 '22 06:12 VictorTaelin

I think the goal should be to avoid unproductive stealing with suitable heuristics, well-placed back-offs, etc. If threads spend a lot of effort stealing the work of each-other and no parallelism is had, that should be identifiable so we can just stop doing that - right?

I think the exponential backoff should be doing that, but it seems that they're able to steal tasks just often enough that they don't just keep sleeping forever and instead all threads just run slowly.

Another problem of the current implementation is that after reducing a term to WHNF the backoff object is deleted and created again to reduce the subterms, so the heuristical knowledge is lost, although that doesn't happen in this specific example

developedby avatar Dec 07 '22 09:12 developedby

Regarding Backoff: Spinning (either with yield_now, Backoff, spin_loop_hint, etc.) is an inefficient use of CPU resources, there could be oversubscription, and it also wastes energy / CPU cycles (https://github.com/crossbeam-rs/crossbeam/issues/821). Putting the threads to sleep properly would be better. Spinning also maximizes contention on the task queues, which is why libs like rayon/tokio have notification throttling to combat this. These libs also don't let threads keep stealing from each other when there's not enough work to do. E.g. in rayon, threads go from active to idle (searching for work to steal from other threads) to sleepy (about to go to sleep after doing a last attempt to find work) and then to sleeping. And e.g. Tokio limits the number of threads who are stealing at the same time.

The threads spawned aren't the only threads running on the system. Unless each thread gets pinned to a core, the OS can move the producer to the core of the spinning consumer (either randomly or to improve temporal cache locality) and the consumer can waste the core's quota spinning even though the producer (on the same core) would have been ready to execute, so the more spinning stealer threads there are, the more potential for slowing down producers. Because to the CPU, a spinning thread looks like it's being very productive, even busier than actually productive threads who are intermittently blocked (due to a cache miss usually, but possibly due to branch mispredict or even a long-latency operation). In the situation above with Func, if you increase the thread count to a number much higher than the available CPU cores, the R/s would degrade further because of all the spinning stealer threads hogging cores & choking producers. (Then try doing something similar with a rayon thread pool to see the difference.)

Scheduling a task involves both queueing it to be popped & telling other worker threads to pop and perform it. crossbeam-dequeue could be used for the first part. Rayon uses crossbeam to do the first part, but then also provides the second part (proper notification throttling, limiting excessive stealing, thread sleeping).

"telling others to pop a task" without a bunch of overhead is its own challenge:

  • https://zig.news/kprotty/resource-efficient-thread-pools-with-zig-3291#notification-throttling
  • https://tokio.rs/blog/2019-10-scheduler#throttle-stealing
  • https://github.com/rayon-rs/rayon/tree/master/rayon-core/src/sleep#readme
  • https://github.com/rayon-rs/rfcs/pull/5

If we want to get something that's robust fast, we can use rayon's scoped pool :)

Boscop avatar May 02 '23 19:05 Boscop

I was messing around with my own interaction net evaluator for educational purposes ( not working yet or anything ), and I can testify to Rayon being extremely nice to use, and has so far prevented me from reaching for unsafe code in many cases.

zicklag avatar May 03 '23 00:05 zicklag