lace
lace copied to clipboard
Lace 2.0
Lace 2.0 is a redesign of the API, work in progress. The problem to be solved is that in its current (pre-2.0) state, Lace is quite intrusive for a developer as it uses macros to replace normal function definitions. While some amount of intrusion is inevitable, one might say it is excessive in the current form.
Example:
TASK_1(int, fib, int, n) {
if (n<2) return n;
SPAWN(fib, n-1);
int a = CALL(fib, n-2);
int b = SYNC(fib);
return a+b;
}
In a package such as Sylvan, this means that large parts of the source code consists of these kind of macros. This is clearly difficult to read at first.
The reason for these macros is that this allows automatic injection of the pointers to the lace worker struct and the task queue head as parameters of Lace tasks. This shaves off a tiny bit of work because in particular the queue head does not need to be stored in memory, except possibly in the program stack. This efficiency is obvious in very fine grained parallelism with tiny tasks, such as fibonacci and n-queens.
An easier API would be this:
TASK_1(int, fib, int, n)
int fib(int n) {
fib_SPAWN(n-1);
int a = fib(n-2);
int b = fib_SYNC();
return a+b;
}
In this case, the lace worker struct is a thread-local variable, which needs to be obtained each time SPAWN
and SYNC
are used, and the task queue head is stored inside the worker struct. So this requires a write operation to this worker struct each time SPAWN
and SYNC
are used, which was avoided before.
We still need some way of defining the helper functions for work stealing, which is now a TASK
macro that simply defines a number of relatively small static inline
functions, such as fib_SPAWN
, fib_SYNC
and something like fib_RUN
that ensures that the fib
function is executed by a Lace worker. One can also just call fib
directly, which required a CALL
macro before.
What does this mean for performance? I ran a number of benchmarks on my desktop computer running Linux (Manjaro). See the following.
Results after 5 iterations:
Using -O2, gcc 12.1.1 and clang 14.0.6
benchmark : gcc lace-1 : gcc lace-2 : clang lace-1 : clang lace-2
cholesky-1 : 6.69 : 6.53 : 5.13 : 5.45
cholesky-2 : 3.41 : 3.32 : 2.62 : 2.79
cholesky-6 : 1.24 : 1.25 : 1.00 : 1.02
cholesky-seq-1 : 6.94 : 6.97 : 4.49 : 4.51
fib-1 : 8.14 : 10.86 : 10.84 : 16.78
fib-2 : 4.12 : 5.53 : 5.54 : 8.56
fib-6 : 1.55 : 2.05 : 1.95 : 2.96
fib-seq-1 : 3.74 : 3.71 : 5.15 : 5.15
heat-1 : 0.73 : 0.73 : 0.50 : 0.50
heat-2 : 0.40 : 0.40 : 0.36 : 0.36
heat-6 : 0.35 : 0.35 : 0.35 : 0.35
heat-seq-1 : 0.72 : 0.72 : 0.50 : 0.50
integrate-1 : 2.84 : 2.98 : 2.99 : 3.41
integrate-2 : 1.45 : 1.51 : 1.52 : 1.74
integrate-6 : 0.50 : 0.53 : 0.53 : 0.60
integrate-seq-1 : 2.67 : 2.69 : 2.67 : 2.68
matmul-1 : 5.25 : 5.24 : 4.87 : 4.83
matmul-2 : 2.67 : 2.67 : 2.49 : 2.45
matmul-6 : 0.93 : 0.95 : 0.86 : 0.85
matmul-seq-1 : 5.16 : 5.24 : 4.94 : 4.80
nqueens-1 : 4.81 : 6.09 : 5.31 : 5.55
nqueens-2 : 2.46 : 3.10 : 2.71 : 2.83
nqueens-6 : 0.90 : 1.22 : 0.98 : 1.02
nqueens-seq-1 : 4.87 : 4.88 : 4.06 : 4.05
uts-t2l-1 : 21.04 : 21.27 : 21.26 : 21.29
uts-t2l-2 : 10.71 : 10.80 : 10.90 : 10.85
uts-t2l-6 : 3.74 : 3.81 : 3.75 : 3.75
uts-t2l-seq-1 : 20.83 : 20.63 : 20.29 : 20.34
uts-t3l-1 : 16.73 : 16.88 : 16.95 : 16.97
uts-t3l-2 : 8.50 : 8.59 : 8.64 : 8.66
uts-t3l-6 : 2.95 : 3.00 : 2.97 : 3.00
uts-t3l-seq-1 : 16.52 : 16.48 : 16.28 : 16.38
From these benchmarks, it is obvious that the very fine grained benchmarks fib and nqueens suffer a performance degradation of about 30%, and these benchmarks consist of billion of nearly empty tasks. The same holds to a lesser degree for integrate
, which is a program that does some arithmetic work without accessing memory. The other benchmarks are not significantly affected by this change, in particular the challenging uts program (unbalanced tree search) has no significant difference in performance or speedup.
Notice also that the performance difference is in the case of fib equal to the difference of using another compiler. The reason that the fibonacci benchmark is so fast with current Lace is due to a performance optimization that gcc does better than clang. Although a caveat here is that I compiled with -O2
and not -O3
, maybe I should rerun all benchmarks again with -O3
.
The question is now whether or not to go ahead with this new design, at the cost of performance of certain niche benchmark programs.
Interesting. I actually thought about this the first time I saw lace. But I quickly realized it is by far not as bad with macros as it looked like on the first sight.
One way or another I would certainly welcome running the benchmarks with -O3
as that will give us more insight. I am afraid that it will not improve much (but only "cosmetically") on -O2
.
So instead I will play the devil's advocate and propose to stay with macros :wink:. The sole reasoning to slightly decrease the use of macros feels kind of niche to me if there is no real issues with them in practice. I for myself would be more interested e.g. in power saving by spinning up&down threads in runtime as discussed in https://github.com/trolando/lace/issues/9 :wink:.
To support the macros side of view feel free to take a look at https://github.com/Hirrolot/interface99 and https://github.com/Hirrolot/datatype99 for modern macro library for C.
Regarding #9, if no tasks are running, then you can already use lace_stop
and lace_start
to change the number of workers. What could be easy to implement is to have some maximum amount of workers and start as many Lace threads, then have a desired number of workers as follows:
- if the desired number is above the active number of workers, wake up some Lace workers
- if the desired number is below the active number of workers, then when a worker has no more work, it goes to sleep
The hard part is if we would want to remove workers on-the-fly while they are doing work, which means this work will need to be executed by another worker and cleanly disposing of current work will be a challenge; I don't know if there is an efficient way to do this. Also, I'm not sure if Lace is the right framework to do this. Lace is based on busy waiting workers, that's not suitable for mobile platforms. The intended application is large computations performed on multi-core machines.
Regarding the linked repositories, I like those READMEs and maybe I should improve the README.md on my own projects a bit.
Regarding the Lace 2.0 API, so far I find it cleaner. The TASK
macro now simply defines a few additional methods and can just be added to header files, and I suspect that it is also easier for IDEs to provide code suggestions, although I have not checked this as I use vim for C development... I am curious what users of Sylvan think of changing Lace in this way, such as @tquatmann, @sjunges, @SSoelvsten.
Yes, the macros feel painful at first, but Lace is a pretty low-level library anyways and it will be mostly be used by people that want to optimize for performance. I don't know whether changing the API helps that much: Maybe just adding this to the documentation would help for people to understand what the "c-style" semantics are of this macro.
The points made in #9 reflect quite well with the discussion we are having in the linked storm thread.
@trolando Happy birthday, btw! :-)
The following should be taken with quite some reservations, considering I've only been reading Sylvan's source code, not actually writing anything using Lace.
I would not say that the new API is drastically better or is drastically better to read - slightly, but not a lot. The fib_SPAWN
and fib_SYNC
functions still kinda come out of the blue, and they still require an understanding of worker-stealing. The auto-completion in Visual Studio Code (and presumably similar) does work slightly better, but type inference seems to work as well for both.
As Sebastian Junges also said: maybe one cannot make Lace much better at the level of abstraction it is at. A lot can probably be done merely by providing more documentation: for example expand the fibonacci example into a larger Markdown example, with some drawings, and introducing one concept (SPAWN
+SYNC
, CALL
etc.) at a time.
In many cases, the pattern is always "spawn n-1 recursive calls and run the n'th on the same thread". I wonder whether it is possible to provide a higher-level abstraction of this pattern, i.e. something of the form.
TASK_1(int, fib, int, n)
{
int a, b = fib(n-1, n-2);
return a+b;
}
Finally, you may have a better idea of whether this is a good idea after looking at how this affects Sylvan, i.e. both its code clarity and its performance?
Improving the documentation is a good idea. I've updated the README.md on the master branch.
The original reason for working on a new API is that the macros can be confusing for novice developers working on Sylvan or LTSmin for their student projects.
The par construct that was just added to the Flix language might also be relevant for an intuitive design
https://github.com/flix/flix/blob/master/main/src/library/RedBlackTree.flix#L423