dmd fix #21426: Windows GC only wakes up one thread when doing parallel scanning

use manual event reset to wake up all background scanning threads, because sys.core.event.setIfInitialized only wakes up one thread otherwise. These threads keep spinning until busyThreads becomes 0, which only happens after the event is manually reset from the main scanning thread.

Jun 04 '25 19:06 rainers

Thanks for your pull request, @rainers!

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

⚠️⚠️⚠️ Warnings ⚠️⚠️⚠️

In preparation for migrating from Bugzilla to GitHub Issues, the issue reference syntax has changed. Please add the word "Bugzilla" to issue references. For example, Fix Bugzilla Issue 12345 or Fix Bugzilla 12345.(Reminder: the edit needs to be done in the Git commit message, not the GitHub pull request.)

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + dmd#21428"

Jun 04 '25 19:06 dlang-bot

Too late/risky for stable?

Jun 04 '25 19:06 rainers

So what I don’t like about this fix is that once a gc scan is done and evdone is set, the threads will go through loops of starting and stopping until the main thread resets the event. In the new gc I’m using manual reset event but it’s set when data is in the queue and reset when it’s empty. This allows me to use it as a sync function for waiting for data to appear.

if I don’t misunderstand the code, all the threads set the done event. It’s also an auto reset event. The whole mechanisms seem very weird to me.

fwiw this is what I’m doing in the new gc. https://github.com/symmetryinvestments/symgc/blob/4989a9c2307fb3193fd05de28a9c0beb8919f584/source/d/gc/scanner.d#L19

Jun 04 '25 21:06 schveiguy

The evDone event is only used for a controlled shutdown of the GC. When used at the end of the scan with the scanStack running dry, evDone.wait is used as a "sleep" for a millisecond. I guess this can be handled more efficiently, but I'm not sure whether always signaling when empty/non-empty is more efficient in all situations, e.g. when a long chain of nodes in a list results in only few entries to be on the stack. A small amount of work might be handled more efficiently in a single thread.

Anyway, I think that is separate issue from threads not running at all.

Jun 05 '25 05:06 rainers

The evDone event is only used for a controlled shutdown of the GC.

It's set by every background thread whenever the thread has exited the scan section.

https://github.com/dlang/dmd/blob/8762d1eaee42119269b82a1b1b7063c89c1e6a69/druntime/src/core/internal/gc/impl/conservative/gc.d#L3736

My understanding is the evDone is to get all the threads back to sleeping on the evStart event.

Jun 05 '25 05:06 schveiguy

I also am concerned about the way busyThreads are managed. I see a potential case where there is data left to scan, but a perfect storm happens and all the threads become unbusy, even though that data is still there, and the threads exit.

For instance, if you inserted a sleep here then might it be more likely to expose this race?

The whole thing seems very wrong to me.

In any case, another issue for another time. Clearly it doesn't happen very much, or at all, as missing a scan of data could be disastrous for the GC.

Jun 05 '25 05:06 schveiguy

My understanding is the evDone is to get all the threads back to sleeping on the evStart event.

Yes, it stops the eventual wait(1ms) while the stack is empty but other threads are still scanning. This currently happens one at a time (the first thread to notice that no more work to do wakes up the second, and so on), so potentially not optimal (manual reset might help), but the main thread is not waiting for that to happen but when shutting down the GC. Note that the mark function also runs until it cannot pop anything from the stack anymore.

I also am concerned about the way busyThreads are managed.

I think it is ok, but probably too complicated. I can try a simpler approach in another PR.

Jun 05 '25 07:06 rainers

Hold-off merging, I have run our benchmarks, and this are the results:

MIN conalloc         0.752 s,     5 MB, 1965 GC  226 ms, Pauses  223 ms <    0 ms
MIN conappend        0.182 s,     5 MB,  445 GC   51 ms, Pauses   49 ms <    0 ms
MIN concpu           0.734 s,     5 MB, 2251 GC  237 ms, Pauses  234 ms <    0 ms
MIN conmsg           1.300 s,    12 MB,  121 GC   22 ms, Pauses   16 ms <    0 ms
MIN dlist            1.109 s,    22 MB,   51 GC   86 ms, Pauses   84 ms <    2 ms
MIN huge_single      0.025 s,  1501 MB,    2 GC    0 ms, Pauses    0 ms <    0 ms
MIN rand_large       1.652 s,   165 MB, 8476 GC  440 ms, Pauses  194 ms <    0 ms
MIN rand_small       3.422 s,    22 MB, 3596 GC  465 ms, Pauses  210 ms <    0 ms
MIN slist            1.082 s,    35 MB,   19 GC   48 ms, Pauses   46 ms <    3 ms
MIN testgc3          1.603 s,   330 MB,   11 GC  143 ms, Pauses  139 ms <   36 ms
MIN tree1            0.347 s,    12 MB,   34 GC   48 ms, Pauses   47 ms <    2 ms
MIN tree2            0.514 s,     1 MB,  216 GC    9 ms, Pauses    7 ms <    0 ms
MIN vdparser         1.267 s,    70 MB,   13 GC   32 ms, Pauses   27 ms <    3 ms
MIN words            0.721 s,   396 MB,    5 GC   11 ms, Pauses    9 ms <    7 ms

while these are the results with the automatic reset (scanning only with 2 threads):

MIN conalloc         0.651 s,     5 MB, 1723 GC  158 ms, Pauses  156 ms <    0 ms
MIN conappend        0.169 s,     5 MB,  358 GC   37 ms, Pauses   36 ms <    0 ms
MIN concpu           0.679 s,     5 MB, 2100 GC  190 ms, Pauses  188 ms <    0 ms
MIN conmsg           1.294 s,    12 MB,   64 GC   11 ms, Pauses    7 ms <    0 ms
MIN dlist            1.111 s,    22 MB,   51 GC   84 ms, Pauses   81 ms <    2 ms
MIN huge_single      0.023 s,  1501 MB,    2 GC    0 ms, Pauses    0 ms <    0 ms
MIN rand_large       1.645 s,   162 MB, 8805 GC  416 ms, Pauses  180 ms <    0 ms
MIN rand_small       3.299 s,    22 MB, 3596 GC  429 ms, Pauses  186 ms <    0 ms
MIN slist            1.071 s,    35 MB,   19 GC   47 ms, Pauses   46 ms <    3 ms
MIN testgc3          1.619 s,   330 MB,   11 GC  116 ms, Pauses  112 ms <   33 ms
MIN tree1            0.338 s,    12 MB,   34 GC   34 ms, Pauses   33 ms <    1 ms
MIN tree2            0.499 s,     1 MB,  216 GC    4 ms, Pauses    3 ms <    0 ms
MIN vdparser         1.313 s,    70 MB,   13 GC   82 ms, Pauses   79 ms <   13 ms
MIN words            0.714 s,   396 MB,    5 GC   10 ms, Pauses    8 ms <    7 ms

Somehow overall pause time seems to have increased. I suspect the additional evDone sleeps for more threads are causing the slowdown. I'll have to check later...

Jun 05 '25 08:06 rainers

I think it might be better to just use an event for synchronization, instead of the 1ms sleeps. This will wake up a lot of threads unnecessarily.

Honestly, I don't get the sync setup here. I would think just one event is needed, and all it does is say "there is stuff ready to scan", and it's set when you add something to the list, and reset when you remove the last thing from the list. Just that event and the stopGC flag I think would be enough?

Jun 05 '25 18:06 schveiguy

Note that an auto-reset event might actually be appropriate in that case! The thread that wakes up can re trigger if there's still data left.

Jun 05 '25 18:06 schveiguy

I would think just one event is needed, and all it does is say "there is stuff ready to scan", and it's set when you add something to the list, and reset when you remove the last thing from the list. Just that event and the stopGC flag I think would be enough?

I implemented that, but results for short GC cycles get worse:

MIN conalloc         0.863 s,     5 MB, 3356 GC  354 ms, Pauses  350 ms <    0 ms
MIN conappend        0.202 s,     5 MB,  590 GC   72 ms, Pauses   70 ms <    0 ms
MIN concpu           0.883 s,     5 MB, 3193 GC  364 ms, Pauses  359 ms <    0 ms
MIN conmsg           1.357 s,    12 MB,   79 GC   17 ms, Pauses   13 ms <    0 ms
MIN dlist            1.128 s,    22 MB,   51 GC   91 ms, Pauses   89 ms <    2 ms
MIN huge_single      0.025 s,  1501 MB,    2 GC    0 ms, Pauses    0 ms <    0 ms
MIN rand_large       1.705 s,   165 MB, 6884 GC  593 ms, Pauses  281 ms <    0 ms
MIN rand_small       4.133 s,    22 MB, 3596 GC  555 ms, Pauses  230 ms <    0 ms
MIN slist            1.056 s,    35 MB,   19 GC   45 ms, Pauses   44 ms <    2 ms
MIN testgc3          1.508 s,   330 MB,   11 GC   45 ms, Pauses   42 ms <    9 ms
MIN tree1            0.334 s,    12 MB,   34 GC   33 ms, Pauses   31 ms <    1 ms
MIN tree2            0.727 s,     1 MB,  216 GC    5 ms, Pauses    4 ms <    0 ms
MIN vdparser         1.276 s,    70 MB,   13 GC   29 ms, Pauses   24 ms <    3 ms
MIN words            0.721 s,   396 MB,    5 GC   10 ms, Pauses    8 ms <    7 ms

The benchmarks are not very reproducable as can be seen by the different number of GC invocations, but the trend seems to be: tests that only run a few but longer GC cycles (testgc3, vdparser) benefit from the parallel scanning (16 threads in my case), but others with many but shorter collections take longer overall. I suspect that comes from the synchronization overhead.

Note that an auto-reset event might actually be appropriate in that case! The thread that wakes up can re trigger if there's still data left.

Indeed, maybe too many threads are fighting over the next stack entry.

Jun 06 '25 06:06 rainers

Switched back to master, as it might not come without side-effects. I also pushed the simplified event handling.

Jun 06 '25 07:06 rainers

Note that an auto-reset event might actually be appropriate in that case! The thread that wakes up can re trigger if there's still data left.

Indeed, maybe too many threads are fighting over the next stack entry.

Implemented that, the results improve quite a bit:

MIN conalloc         0.775 s,     5 MB, 2963 GC  278 ms, Pauses  275 ms <    0 ms
MIN conappend        0.196 s,     5 MB,  758 GC   73 ms, Pauses   71 ms <    0 ms
MIN concpu           0.795 s,     5 MB, 3548 GC  305 ms, Pauses  302 ms <    0 ms
MIN conmsg           1.267 s,    12 MB,   58 GC   10 ms, Pauses    6 ms <    1 ms
MIN dlist            1.088 s,    22 MB,   51 GC   69 ms, Pauses   67 ms <    1 ms
MIN huge_single      0.024 s,  1501 MB,    2 GC    0 ms, Pauses    0 ms <    0 ms
MIN rand_large       1.578 s,   162 MB, 8737 GC  439 ms, Pauses  200 ms <    0 ms
MIN rand_small       3.284 s,    22 MB, 3596 GC  393 ms, Pauses  169 ms <    0 ms
MIN slist            1.075 s,    35 MB,   19 GC   48 ms, Pauses   47 ms <    3 ms
MIN testgc3          1.524 s,   330 MB,   11 GC   42 ms, Pauses   38 ms <    8 ms
MIN tree1            0.349 s,    12 MB,   85 GC   58 ms, Pauses   56 ms <    1 ms
MIN tree2            0.499 s,     1 MB,  216 GC    4 ms, Pauses    3 ms <    0 ms
MIN vdparser         1.253 s,    70 MB,   13 GC   26 ms, Pauses   22 ms <    2 ms
MIN words            0.722 s,   396 MB,    5 GC   10 ms, Pauses    9 ms <    7 ms

It doesn't seem to harm the short GC cycles anymore, but the longer ones benefit from parallelism. BTW: do you have a more reliable set of benchmarks?

Jun 06 '25 08:06 rainers

I don't have any benchmarks that are public. The fluctuation is pretty high! For instance, you had conalloc going from 1723 cycles to 3356 cycles. How is that possible? If I read the code correctly, it's not even allocating that many times? 2500 allocations?

Jun 06 '25 13:06 schveiguy

Ohh! there is this loop: https://github.com/dlang/dmd/blob/8762d1eaee42119269b82a1b1b7063c89c1e6a69/druntime/benchmark/gcbench/conalloc.d#L39-L44

So this is going to be a huge wildcard. Why do we have this loop in there?

Jun 06 '25 13:06 schveiguy

So this is going to be a huge wildcard. Why do we have this loop in there?

I guess to get at least some GC activitity. If I disable the loop, I get

MIN conalloc         0.469 s,     5 MB,    1 GC    0 ms, Pauses    0 ms <    0 ms
MIN conappend        0.122 s,     5 MB,   40 GC    6 ms, Pauses    5 ms <    0 ms
MIN concpu           0.469 s,     5 MB,    1 GC    0 ms, Pauses    0 ms <    0 ms

I remember that I used to mostly ignore these concurrency tests as they yield to inconsistent results.

Jun 07 '25 11:06 rainers

Maybe up the N value to get more activity? Or limit the number of times this loop can run?

In any case, I don't think we can trust the total time numbers there as it will fluctuate with the number of times that loop can trigger a GC, but maybe dividing the time by the number of GC cycles is a reasonable measurement?

Jun 07 '25 15:06 schveiguy

Doing a re-read of the whole parallel section, I think it looks a lot better. I think the busyThreads race is gone now, and the event for data ready looks much cleaner.

I have a bit of concern doing the read of length to be 0 without atomics, and I think the evDone event can just be removed.

Jun 07 '25 16:06 schveiguy

Maybe up the N value to get more activity? Or limit the number of times this loop can run?

In any case, I don't think we can trust the total time numbers there as it will fluctuate with the number of times that loop can trigger a GC, but maybe dividing the time by the number of GC cycles is a reasonable measurement?

I think these tests are too CPU heavy and don't really test the GC. Increasing N will only let the execution time explode with only a fraction spent in the GC, so we should just ignore them for now.

I made the "ultimate" test, building all the phobos unittests with all files on a single command line. Here are the results with different GC options:

command              Process Mem / Time
dmd:                     19.4 GB   1:58.830 min:sec  GC memory:10187 MB,    1 GC  3212 ms, Pauses  1772 ms < 1772 ms
dmd -lowmem parallel:0:  12.3 GB   2:50.590 min:sec  GC memory: 7883 MB,   18 GC 46983 ms, Pauses 45594 ms < 7100 ms
dmd -lowmem parallel:1:  12.3 GB   2:31.680 min:sec  GC memory: 7883 MB,   18 GC 28956 ms, Pauses 27538 ms < 4231 ms
dmd -lowmem parallel:4:  12.3 GB   2:18.060 min:sec  GC memory: 7883 MB,   18 GC 17524 ms, Pauses 16081 ms < 2404 ms
dmd -lowmem parallel:7:  12.3 GB   2:17.120 min:sec  GC memory: 7883 MB,   18 GC 12620 ms, Pauses 11117 ms < 1585 ms
dmd -lowmem parallel:11: 12.3 GB   2:10.860 min:sec  GC memory: 7883 MB,   18 GC 11585 ms, Pauses 10079 ms < 1409 ms
dmd -lowmem parallel:15: 12.3 GB   2:12.030 min:sec  GC memory: 7883 MB,   18 GC 11766 ms, Pauses 10310 ms < 1433 ms
dmd -lowmem disable:1    20.0 GB   2:17.370 min:sec  GC memory:15435 MB,    1 GC  5564 ms, Pauses  4770 ms < 4770 ms
dmd -lowmem gc:precise   12.4 GB   2:21.620 min:sec  GC memory: 7883 MB,   18 GC 12570 ms, Pauses 11201 ms < 1563 ms
													 
dmd BumpGC:              30.0 GB   2:01.140 min:sec  
dmd cleanup:none         19.4 GB   1:55.280 min:sec  GC memory:10187 MB,    0 GC     0 ms, Pauses     0 ms <    0 ms

dmd is compiled from this branch with a dmd built from this branch.
this is run on a mobile AMD Ryzen 7 8845HS with 8 cores+HT and 64 GB RAM,. The latter causes the GC to run rather seldomly.
measurements can vary by a couple of seconds
parallel:1 should be similar to the currently released version
GC and pause times scale rather well up to parallel:7 (8 threads), though a bit slower than 1/threads. After that there is not that much improvement, probably due to hyper-threading not being full cores.
even though dmd is meant to use the bump allocator by default it still allocates half of its memory with the disabled GC. This is because the lowered druntime calls (e.g. for new class) don't use the callbacks anymore that dmd hooks into.
"BumpGC" selects a GC that just forwards malloc to the bump allocator. It is not faster than the version using the GC for half the allocations, but uses about 10 GB more memory. Not sure why that is, I don't think that there are so many array operations that might reallocate instead of appending.
cleanup:none skips the final collection before shutdown, which is probably also the longest, a simple win of up to 2% execution time

Jun 07 '25 20:06 rainers

I have a bit of concern doing the read of length to be 0 without atomics, and I think the evDone event can just be removed.

I think the check while the lock is held is good enough, too. That should cause no big penalty as the thread will wait for the next evStackFilled event anyway if the stack is actually empty.

Jun 07 '25 21:06 rainers

even though dmd is meant to use the bump allocator by default it still allocates half of its memory with the disabled GC. This is because the lowered druntime calls (e.g. for new class) don't use the callbacks anymore that dmd hooks into.

In a recent DLF meeting, we talked about switching the bump GC to be a full GC interface, so we don't have to hook any of these things, just change the underlying allocator. Really, a bump allocator is a GC, just garbage is collected at the end of the process.

Jun 24 '25 11:06 schveiguy

In a recent DLF meeting, we talked about switching the bump GC to be a full GC interface, so we don't have to hook any of these things, just change the underlying allocator. Really, a bump allocator is a GC, just garbage is collected at the end of the process.

Some GC operations are not easily implemented with the bump allocator, like GC.realloc, if you don't keep track of allocation sizes. D arrays can use huge amounts of memory if they cannot query allocation sizes, too. Having to support this decreases the benefits of the bump allocator. (I don't know why the test above worked with the "BumpGC", later it crashed due to the realloc-issue).

Jun 25 '25 05:06 rainers

@schveiguy Now pulled out the locking from the scan stack, so evStackFilled could be converted back to being manual reset and busyThreads changed under the same lock. Should be very similar to your new GC now.

Jun 27 '25 14:06 rainers

Ah nice, I see evDone is now only waited on by the main thread. This looks much better. I'm going to give it another once-over.

Jul 03 '25 23:07 schveiguy

tracking of busy threads looks much better BTW!

Jul 04 '25 22:07 schveiguy

thank you @rainers!

Jul 07 '25 19:07 schveiguy

Thanks @schveiguy for the reviews.

Jul 08 '25 05:07 rainers