qthreads icon indicating copy to clipboard operation
qthreads copied to clipboard

Sync vars serialize execution for schedulers w/o work-stealing

Open ronawho opened this issue 6 years ago • 5 comments

Chapel users (and devs) have run into some serious performance regressions when using sync variables as a lock. We narrowed this down to some code in the FEB scheduling code that serializes qthreads to a single shepherd, which effectively serializes execution for schedulers that don't support work-stealing. (we default to the nemesis scheduler with 1 worker per shepherd, and 1 shepherd per core):

https://github.com/Qthreads/qthreads/blob/d10360025b525f62b748934ab6f9b0781d385fd8/src/feb.c#L179-L188

Where the basic issue is that when a qthread is woken up after blocking, it's scheduled onto the current shepherd (unless QTHREAD_UNSTEALABLE was set for that qthread, in which case it's spawned to the original shepherd.)

https://github.com/chapel-lang/chapel/issues/9651 was the issue tracking this on our end. https://github.com/chapel-lang/chapel/commit/26c238a769 is the workaround that we're using, but it's probably not the "right" solution for qthreads. We just always add the qthread back to the original shepherd.

To test performance we added an SPMD version of STREAM-triad. If it's helpful, I can probably make a pure qthreads reproducer too. Here's the Chapel kernel we were using:

coforall tid in 0..#numTasks {
  barrier.barrier();
  for i in chunk(1..m, numTasks, tid) do
    A[i] = B[i] + alpha * C[i];
}

and the performance we saw with no barrier, a sync based barrier, and an atomic based barrier:

config performance
no barrier 84 GB/s
atomic barrier 84 GB/s
sync barrier 10 GB/s

I think the "right" solution here is to probably have a way to query if schedulers support work-stealing and if it's currently enabled, and check that in addition to QTHREAD_UNSTEALABLE

ronawho avatar Sep 05 '18 22:09 ronawho

As I recall, the original design was based on the assumption of a producer/consumer relationship. In that case, the consumer (that just got woken up) will likely want to consume the value with the warm cache of the thread that produced it. But, as you point out, that's not a good strategy when FEBs are being used for non-p/c synchronization patterns, especially barriers. That's probably also true for situations where the thread that released the waiter is intending to continue living and producing data for a while.

The simplest "good" solution, I think, is to make qt_feb_schedule() use qt_threadqueue_choose_dest(), just like qt_spawn() does. It may not be the optimal solution in all cases, but it'll avoid performance issues like this.

Setting QTHREAD_UNSTEALABLE is a workaround for the barrier case, but is (imho) a bad general-case solution because sometimes all the threads woken up could have come from the same source-thread (we have no way to know a-priori, and it's probably unnecessarily expensive to check).

m3m0ryh0l3 avatar Sep 10 '18 22:09 m3m0ryh0l3

Ah, interesting, thanks for the context/history.

To date, a majority of our use-cases have been with using sync-variables as locks. I don't think we have any cases that are actually using the sync-barrier in production, but it was a pretty easy code to demonstrate the worse-case performance.

ronawho avatar Sep 10 '18 22:09 ronawho

Is qt_feb_schedule() in the right place? It seems like the problem is that you want different FEB scheduling disciplines depending on your scheduler. Using qt_threadqueue_choose_dest() sort of does that, but febs are quite often a scheduling special case depending on context (e.g. are they just synchronization or an actual data channel?).

It might be worth adding settable flags to individual febs to specify their intended use. We have already done that to a small extent with actual Qthreads by adding a NETWORK flag. There's no reason (I know of) FEBs couldn't have a SYNC/DATA flag and different scheduling behaviors for each.

npe9 avatar Sep 11 '18 00:09 npe9

It's entirely likely that qt_feb_schedule() should, instead of taking a single thread, take a list of threads, so that it can make decisions about them as a group.

The problem with annotating FEBs themselves is that they're somewhat transient, and can (in theory) exist anywhere (the implementation was modeled after the MTA, after all) - which is, admittedly, a bit of an impedance mis-match with Chapel sync vars. In order to do a SYNC/DATA flag properly, I think you'd need to have that flag on the particular writeEF/readFE/readFF operation, rather than on the address it manipulates. But I would be willing to believe that a scheduler-specific "schedule unblocked threads" function would be useful, especially if it could be fed those kinds of additional information regarding intended semantic.

m3m0ryh0l3 avatar Sep 11 '18 13:09 m3m0ryh0l3

Just noting this is still of interest to us. We have a workaround we apply, but it doesn't seem particularly elegant.

ronawho avatar Jan 26 '21 17:01 ronawho