wybe
wybe copied to clipboard
Wybe program runs slower on system with more cores
This is a bit odd. The same program can become slower when there are more cpu cores in the system. I used the same physical machine to run vm and allocate a different number of cpu cores to it. Then I ran the same wybe program on it and here are the results:
number of cores | time used |
---|---|
1 | 54s |
2 | 58s |
6 | 128s |
12 | 220s |
I suspect that the issue is caused by the gc, because:
- The situation is worse on a program with more memory allocation.
- Wybe program will create multiple threads with the number equal to the number of cores in the system.
After a quick look at the doc of the BDW GC, it seems that the gc should be single-threaded with incremental mode disabled by default. Not sure why and I will keep investigating this.
Interesting. Can you try a compute-bound process and see if that behaves differently? Could it be a VM issue?
Interesting. Can you try a compute-bound process and see if that behaves differently? Could it be a VM issue?
I don't think it's a VM issue. It's very hard to find machines that are identical expect number of cores. So I tried different virtualization technologies and the issue still exists. I also try to run it on my friend's macbook which is 8 cores (16 logical cores) and it's about 3x slower than mine (2 cores and 4 logical cores, and my cpu is a lot older than my friend's).
I didn't find a way to hide cores from a program.
I tried to use taskset
to force a program to run on a given core only.
Since it still know the number of cores the system has, so it starts that many threads as usual. It still runs slower than the 1-core case, but it's faster than run it without taskset
(I think that's odd since now it's 12 threads squashed into 1 core)
I also tried to use echo 0 | sudo tee /sys/devices/system/cpu/cpu1/online
to disable other cores and the results are the same as the previous one.
Removing the bdw gc did solve this issue.
The parallel marking of bdw gc is definitely the cause. The issue can be avoided by setting the environment variable GC_NPROCS=1
.
However, I am not sure how to fix this. I couldn't find any api to set the GC_NPROCS
in runtime except this environment variable. And building a single-threaded bdw-gc seems to be a bit annoying to fit with our current makefile.
Here are some doc related to this for you reference: (By the way, I think the doc of bdw gc is a bit confusing)
https://github.com/ivmai/bdwgc/blob/7c063c84039c95bd065a35732311646c69c3026e/doc/README.autoconf#L58-L59
It is not recommended to turn off parallel marking for multiprocessors unless a poor support of the feature on the platform.
https://github.com/ivmai/bdwgc/blob/98ac50b6311219b8f57a794deb9a72d2a25b23ce/include/gc.h#L91-L102
/* GC is parallelized for performance on */ /* multiprocessors. Set to a non-zero value */ /* only implicitly if collector is built with */ /* PARALLEL_MARK defined, and if either */ /* GC_MARKERS (or GC_NPROCS) environment */ /* variable is set to > 1, or multiple cores */ /* (processors) are available. The getter does */ /* not use or need synchronization (i.e. */ /* acquiring the GC lock). GC_parallel value */ /* is equal to the number of marker threads */ /* minus one (i.e. number of existing parallel */ /* marker threads excluding the initiating one).*/
https://www.hboehm.info/gc/faq.html
For a single-threaded application, use a GC library without thread support. If this is inconvenient, use gc_local_alloc.h.
(But I couldn't find gc_local_alloc.h
)
But still, I don't think it's something reasonable that enabling parallel marking will have such a crazy cost. Should we ask this in the bdwgc's repo?
Update:
I created a issue in the bdwgc repo, and I think for now the only thing we can do is disabling the parallel marking. Currently, the only way to disable it in runtime is to set the env GC_NPROCS
or GC_MARKERS
.
However, I found a new issue, the call to gc_init
is removed by the optimizer.
https://github.com/pschachte/wybe/blob/e246cf326eb749f81f6d142142e1d218493dc035/src/Builder.hs#L859-L861
I'll try to fix this one before adding that workaround to mitigate the gc issue.
Hi Zed,
I believe the call to GC_INIT is made from the generated main() function; if not, that’s where it should be created. If it’s generated as LPVM code and is being optimised away, you should be able to fix that by threading the io state through it as two extra arguments (when the LLVM code is generated, any phantom arguments are removed, so that shouldn’t cause any problems, but should prevent it from being optimised away).
-Peter Schachte
From: "Zed(Zijun) Chen" [email protected] Reply-To: pschachte/wybe [email protected] Date: Tuesday, 9 June 2020 at 00:18 To: pschachte/wybe [email protected] Cc: Peter Schachte [email protected], Comment [email protected] Subject: Re: [pschachte/wybe] Wybe program runs slower on system with more cores (#59)
Update: I created a issue in the bdwgc repo, and I think for now the only thing we can do is disabling the parallel marking. Currently, the only way to disable it in runtime is to set the env GC_NPROCS or GC_MARKERS.
However, I found a new issue, the call to gc_init is removed by the optimizer. https://github.com/pschachte/wybe/blob/e246cf326eb749f81f6d142142e1d218493dc035/src/Builder.hs#L859-L861
I'll try to fix this one before adding that workaround to mitigate the gc issue.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/pschachte/wybe/issues/59#issuecomment-640646219, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACLLL3FCP456S3ALM657PG3RVTXJZANCNFSM4NPVT37Q.
Some update: There is a new api in bdw gc that can adjust number of markers at initialization. However, for now we are using the gc from homebrew on macOS and apt on Ubuntu, I don't think we can get that API soon. What we really need is https://github.com/ivmai/bdwgc/issues/328
Is this still a problem?
Well, the current fix is just a workaround(force the GC to only use one thread for marking). A proper fix requires the new feature of bdw gc: Reduce number of parallel markers depending on heap size.