lace Comparison to Weave

This looks like a simple yet quite efficient implementation of work stealing. I wonder how this compares to a bit more modern work stealing framework Weave.

Before I spend hours of reading & understanding the inner details of Lace, could anyone make a rough comparison (of approaches taken as well as real measured performance). Weave has quite readable primer & documentation, so this comparison is more about the Lace side :wink:.

Nov 01 '21 19:11 dumblob

I'm not aware of Weave. As far as I can see one difference is that Weave is not for C? While Lace is for C?

You can easily compile Lace and run the fibonnaci benchmark and compare it with Weave if you want to compare the performance. I don't have time for this right now.

Nov 02 '21 17:11 trolando

Weave shall compile to C.

Should you find some time to write down how the internals of Lace work and what are the trade-offs, please let me know. Thanks!

Nov 02 '21 18:11 dumblob

Any insights how Lace works?

May 30 '22 22:05 dumblob

I have written two papers about it back in 2014 when it was in active development and there are examples in the benchmarks folder. For example benchmarks/fib/fib-lace.c is a simple example.

You initialize lace with lace_start to start a bunch of workers that will then actively hold your CPU hostage until you use the RUN macro to start a task. Tasks are defined using macros like TASK_1(return_type, name, param_1_type, param_1_name) { code here }. In Lace tasks, you can then SPAWN new tasks to allow other workers to execute them and you use SYNC to obtain the result of the spawned task. (If the task was not stolen, then it's executed by the current worker.) You can also use CALL to directly execute a task instead of putting it on the local task stack for others to steal.

There are a bunch more advanced features like barriers and interrupts, they are used in my Sylvan project. You can also produce statistics on the performance of the load balancer. And you can temporarily stop the Lace workers and restart them using lace_suspend and lace_resume.

May 30 '22 22:05 trolando

Internally as described in my early PhD work, Lace uses a special queue that is optimized for this particular task. It's the topic of one of the papers. The infrastructure is very similar to Wool, a project by Karl Filip Faxen in 2010. This in turn was inspired on Cilk, which used to be part of the GCC compiler for a while. In my opinion, Lace and Wool offer the best performance for fine-grained task-based parallelism. (fork-join semantics) but of course the downside is the abuse of C macros...

May 30 '22 22:05 trolando

By the way you could probably easily check benchmarks if you have experience with Weave. I see here https://github.com/mratsim/weave/tree/master/benchmarks that several of these are from Cilk so probably quite similar to benchmarks implemented in Lace. Simply use ccmake to set LACE_BUILD_BENCHMARKS to ON should suffice to build the benchmarks...

May 30 '22 22:05 trolando

Takes some effort to install this with outdated nim on Manjaro, had to install manual, then figure out how to run stuff, guessing that simply nim r benchmarks/nqueens/weave_nqueens.nim 14 is the appropriate call...

Scheduler:            Weave (eager flowvars)
Benchmark:            N-queens
Threads:              6
Time(us)              41513.84204101562
Max RSS (KB):         22540
Runtime RSS (KB):     8836
# of page faults:     36344
Problem size:         14x14 board with 14 queens
Solutions found:      365596

With Lace, I get:

[tom@ryve build]$ benchmarks/queens-lace -w 6 14
running queens 14 with 6 workers...
Result: Q(14) = 365596
Time: 4.037406

So Lace in 4 seconds, Weave in 41 seconds.

Other benchmark, fib... disappointing. Weave takes 10.852 seconds to compute fib 38. Lace does it in 0.24 seconds on my machine.

Maybe this is not how I'm supposed to run Weave, but I have no clue, never used nim before.

May 30 '22 23:05 trolando

Wow you are quick! Thanks a lot for the description of Lace - admittedly I didn't read the paper, so this your overview made me willing to read it!

Btw. yes, nim is weird in that it requires some parameters to compile a "production build" and nim r is always slow (it is a debug build, forbids optimizations, etc.).

If I have any more questions, I will certainly come back :wink:.

Btw. thanks for mentioning Sylvan - I might have a use case for it in a year or two.

May 31 '22 10:05 dumblob

Using nim c -d:danger ... to compile to a native executable...

[tom@ryve weave]$ ./benchmarks/nqueens/weave_nqueens 15
Scheduler:            Weave (eager flowvars)
Benchmark:            N-queens
Threads:              6
Time(us)              24365.958984375
Max RSS (KB):         23080
Runtime RSS (KB):     21588
# of page faults:     176675
Problem size:         15x15 board with 15 queens
Solutions found:      2279184

[tom@ryve build]$ benchmarks/queens-lace 15 -w 6
running queens 15 with 6 workers...
Result: Q(15) = 2279184
Time: 23.007802

So very similar performance for NQueens, Lace takes 23.01 seconds and Weave takes 24.37 seconds.

[tom@ryve weave]$ benchmarks/fibonacci/weave_fib 41
--------------------------------------------------------------------------
Scheduler:                                    Weave  (eager futures)
Benchmark:                                    Fibonacci
Threads:                                      6
Time(ms)                                      2662.712
Max RSS (KB):                                 25684
Runtime RSS (KB):                             23528
# of page faults:                             5943
--------------------------------------------------------------------------
n requested:                                  41
result:                                       165580141

[tom@ryve build]$ benchmarks/fib-lace 41 -w 6
fib(41) = 165580141
Time: 0.188222

For Fibonacci, which is very sensitive to overhead, Lace takes 188 ms and Weave takes 2663 ms.

May 31 '22 10:05 trolando

In case you wonder if I cheat with fib, it is completely fine-grained:

TASK_1(int, pfib, int, n)
{
    if( n < 2 ) {
        return n;
    } else {
        int m,k;
        SPAWN( pfib, n-1 );
        k = CALL( pfib, n-2 );
        m = SYNC( pfib );
        return m+k;
    }
}

May 31 '22 10:05 trolando

Very interesting! That sparked my interest in seeing even the assembly of both "contestants". The "variation" in performance seen for Weave is something I would like to investigate a bit.

@mratsim would this be something to investigate (maybe it has to do with the changes in Nim lately)?

Btw. @trolando do you use Clang or GCC compiler for both "contestants" on your machine?

Jun 01 '22 09:06 dumblob

Unfortunately improving multithreading is not my priority at the moment, but I did review the code recently.

Here are Lace and Weave on my hardware (i9-9980XE 18 cores): ksnip_20230118-111358 ksnip_20230118-111557

So 2x to 2.5x faster on nqueens but 2.5x to 10x (!) slower on fibonacci.

Weave has 2 configuration, "eager allocated futures" and "lazy allocated futures" with very different overheads (see the # of page faults). In the eager case a future is allocated on the heap as soon as the task is scheduled. In the lazy case, it is allocated on the heap only if stolen, which significantly reduce overhead for fibonacci (a pure overhead bench)

The thing I'm not too sure about is that fib 40 used to take 180 to 220ms with lazy futures (see https://github.com/mratsim/weave/pull/13) and 350ms-400ms with eager futures (see https://github.com/mratsim/weave/issues/112) on the same machine .

I retested fibril https://github.com/chaoran/fibril and perf didn't change there so unsure what's up. Fibril is the lowest overhead I benchmarked on fibonacci when developping Weave, but it took an approach that required platform-specific assembly as it is continuation-stealing/parent-stealing based: ksnip_20230118-115038

Internally as described in my early PhD work, Lace uses a special queue that is optimized for this particular task. It's the topic of one of the papers. The infrastructure is very similar to Wool, a project by Karl Filip Faxen in 2010. This in turn was inspired on Cilk, which used to be part of the GCC compiler for a while. In my opinion, Lace and Wool offer the best performance for fine-grained task-based parallelism. (fork-join semantics) but of course the downside is the abuse of C macros...

Weave is based on the PhD of @aprell (see C impl: https://github.com/aprell/tasking-2.0) https://epub.uni-bayreuth.de/2990/ and also a collection of research here https://github.com/numforge/laser/blob/e23b5d6/research/runtime_threads_tasks_allocation_NUMA.md

The main internal characteristic is that it's based on fully private deques, and steal requests are actually work-sharing requests. Also, the theft policy, steal-one vs steal-half is adaptative, and for for-loop, loops are partitioned lazily instead of eagerly and so auto-adapt to the function being called and the resources available.

In terms of usage, it seems to me like Lace require structured parallelism, i.e. a spawn has to be synced before function exit (async/finish from Habanero, see also structured concurrency in IO runtimes https://en.wikipedia.org/wiki/Structured_concurrency). In Weave, this is the case for the lazy futures (because of alloca) but in the default case, futures can escape their allocating scope. This allows quite flexible usage in complex codebase, at the price of memory allocation overhead. Also Lace seems to keep idle threads spinning.

Some of the main changes made in Weave compared to the C code were:

removing locks, especially on channels and the theft queues
a custom memory pool, inspired by mimalloc (https://www.microsoft.com/en-us/research/uploads/prod/2019/06/mimalloc-tr-v1.pdf) to significantly reduce allocation overhead of tasks (plenty of description in markdown files here https://github.com/mratsim/weave/tree/7682784/weave/memory)
the backoff system

Jan 18 '23 11:01 mratsim

Also Lace seems to keep idle threads spinning.

Potentially related: https://github.com/trolando/lace/issues/9#issuecomment-1243562742

Jan 21 '23 08:01 dumblob

@dumblob what about this commit https://github.com/trolando/lace/commit/67eaa423959b02397121d6f6a4f1ea3a0b5d5bbe

The idea is that if there are no tasks at all, the threads are suspended, as soon as there is any task (offered via the RUN macro) the threads are resumed until all offered tasks are done.

Mar 05 '23 14:03 trolando

lace lace copied to clipboard

Comparison to Weave

lace
lace copied to clipboard