fix #21426: Windows GC only wakes up one thread when doing parallel scanning
use manual event reset to wake up all background scanning threads, because sys.core.event.setIfInitialized only wakes up one thread otherwise. These threads keep spinning until busyThreads becomes 0, which only happens after the event is manually reset from the main scanning thread.
Thanks for your pull request, @rainers!
Bugzilla references
Your PR doesn't reference any Bugzilla issue.
If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.
⚠️⚠️⚠️ Warnings ⚠️⚠️⚠️
- In preparation for migrating from Bugzilla to GitHub Issues, the issue reference syntax has changed. Please add the word "Bugzilla" to issue references. For example,
Fix Bugzilla Issue 12345orFix Bugzilla 12345.(Reminder: the edit needs to be done in the Git commit message, not the GitHub pull request.)
Testing this PR locally
If you don't have a local development environment setup, you can use Digger to test this PR:
dub run digger -- build "master + dmd#21428"
Too late/risky for stable?
So what I don’t like about this fix is that once a gc scan is done and evdone is set, the threads will go through loops of starting and stopping until the main thread resets the event. In the new gc I’m using manual reset event but it’s set when data is in the queue and reset when it’s empty. This allows me to use it as a sync function for waiting for data to appear.
if I don’t misunderstand the code, all the threads set the done event. It’s also an auto reset event. The whole mechanisms seem very weird to me.
fwiw this is what I’m doing in the new gc. https://github.com/symmetryinvestments/symgc/blob/4989a9c2307fb3193fd05de28a9c0beb8919f584/source/d/gc/scanner.d#L19
The evDone event is only used for a controlled shutdown of the GC. When used at the end of the scan with the scanStack running dry, evDone.wait is used as a "sleep" for a millisecond. I guess this can be handled more efficiently, but I'm not sure whether always signaling when empty/non-empty is more efficient in all situations, e.g. when a long chain of nodes in a list results in only few entries to be on the stack. A small amount of work might be handled more efficiently in a single thread.
Anyway, I think that is separate issue from threads not running at all.
The
evDoneevent is only used for a controlled shutdown of the GC.
It's set by every background thread whenever the thread has exited the scan section.
https://github.com/dlang/dmd/blob/8762d1eaee42119269b82a1b1b7063c89c1e6a69/druntime/src/core/internal/gc/impl/conservative/gc.d#L3736
My understanding is the evDone is to get all the threads back to sleeping on the evStart event.
I also am concerned about the way busyThreads are managed. I see a potential case where there is data left to scan, but a perfect storm happens and all the threads become unbusy, even though that data is still there, and the threads exit.
For instance, if you inserted a sleep here then might it be more likely to expose this race?
The whole thing seems very wrong to me.
In any case, another issue for another time. Clearly it doesn't happen very much, or at all, as missing a scan of data could be disastrous for the GC.
My understanding is the evDone is to get all the threads back to sleeping on the evStart event.
Yes, it stops the eventual wait(1ms) while the stack is empty but other threads are still scanning. This currently happens one at a time (the first thread to notice that no more work to do wakes up the second, and so on), so potentially not optimal (manual reset might help), but the main thread is not waiting for that to happen but when shutting down the GC. Note that the mark function also runs until it cannot pop anything from the stack anymore.
I also am concerned about the way busyThreads are managed.
I think it is ok, but probably too complicated. I can try a simpler approach in another PR.
Hold-off merging, I have run our benchmarks, and this are the results:
MIN conalloc 0.752 s, 5 MB, 1965 GC 226 ms, Pauses 223 ms < 0 ms
MIN conappend 0.182 s, 5 MB, 445 GC 51 ms, Pauses 49 ms < 0 ms
MIN concpu 0.734 s, 5 MB, 2251 GC 237 ms, Pauses 234 ms < 0 ms
MIN conmsg 1.300 s, 12 MB, 121 GC 22 ms, Pauses 16 ms < 0 ms
MIN dlist 1.109 s, 22 MB, 51 GC 86 ms, Pauses 84 ms < 2 ms
MIN huge_single 0.025 s, 1501 MB, 2 GC 0 ms, Pauses 0 ms < 0 ms
MIN rand_large 1.652 s, 165 MB, 8476 GC 440 ms, Pauses 194 ms < 0 ms
MIN rand_small 3.422 s, 22 MB, 3596 GC 465 ms, Pauses 210 ms < 0 ms
MIN slist 1.082 s, 35 MB, 19 GC 48 ms, Pauses 46 ms < 3 ms
MIN testgc3 1.603 s, 330 MB, 11 GC 143 ms, Pauses 139 ms < 36 ms
MIN tree1 0.347 s, 12 MB, 34 GC 48 ms, Pauses 47 ms < 2 ms
MIN tree2 0.514 s, 1 MB, 216 GC 9 ms, Pauses 7 ms < 0 ms
MIN vdparser 1.267 s, 70 MB, 13 GC 32 ms, Pauses 27 ms < 3 ms
MIN words 0.721 s, 396 MB, 5 GC 11 ms, Pauses 9 ms < 7 ms
while these are the results with the automatic reset (scanning only with 2 threads):
MIN conalloc 0.651 s, 5 MB, 1723 GC 158 ms, Pauses 156 ms < 0 ms
MIN conappend 0.169 s, 5 MB, 358 GC 37 ms, Pauses 36 ms < 0 ms
MIN concpu 0.679 s, 5 MB, 2100 GC 190 ms, Pauses 188 ms < 0 ms
MIN conmsg 1.294 s, 12 MB, 64 GC 11 ms, Pauses 7 ms < 0 ms
MIN dlist 1.111 s, 22 MB, 51 GC 84 ms, Pauses 81 ms < 2 ms
MIN huge_single 0.023 s, 1501 MB, 2 GC 0 ms, Pauses 0 ms < 0 ms
MIN rand_large 1.645 s, 162 MB, 8805 GC 416 ms, Pauses 180 ms < 0 ms
MIN rand_small 3.299 s, 22 MB, 3596 GC 429 ms, Pauses 186 ms < 0 ms
MIN slist 1.071 s, 35 MB, 19 GC 47 ms, Pauses 46 ms < 3 ms
MIN testgc3 1.619 s, 330 MB, 11 GC 116 ms, Pauses 112 ms < 33 ms
MIN tree1 0.338 s, 12 MB, 34 GC 34 ms, Pauses 33 ms < 1 ms
MIN tree2 0.499 s, 1 MB, 216 GC 4 ms, Pauses 3 ms < 0 ms
MIN vdparser 1.313 s, 70 MB, 13 GC 82 ms, Pauses 79 ms < 13 ms
MIN words 0.714 s, 396 MB, 5 GC 10 ms, Pauses 8 ms < 7 ms
Somehow overall pause time seems to have increased. I suspect the additional evDone sleeps for more threads are causing the slowdown. I'll have to check later...
I think it might be better to just use an event for synchronization, instead of the 1ms sleeps. This will wake up a lot of threads unnecessarily.
Honestly, I don't get the sync setup here. I would think just one event is needed, and all it does is say "there is stuff ready to scan", and it's set when you add something to the list, and reset when you remove the last thing from the list. Just that event and the stopGC flag I think would be enough?
Note that an auto-reset event might actually be appropriate in that case! The thread that wakes up can re trigger if there's still data left.
I would think just one event is needed, and all it does is say "there is stuff ready to scan", and it's set when you add something to the list, and reset when you remove the last thing from the list. Just that event and the stopGC flag I think would be enough?
I implemented that, but results for short GC cycles get worse:
MIN conalloc 0.863 s, 5 MB, 3356 GC 354 ms, Pauses 350 ms < 0 ms
MIN conappend 0.202 s, 5 MB, 590 GC 72 ms, Pauses 70 ms < 0 ms
MIN concpu 0.883 s, 5 MB, 3193 GC 364 ms, Pauses 359 ms < 0 ms
MIN conmsg 1.357 s, 12 MB, 79 GC 17 ms, Pauses 13 ms < 0 ms
MIN dlist 1.128 s, 22 MB, 51 GC 91 ms, Pauses 89 ms < 2 ms
MIN huge_single 0.025 s, 1501 MB, 2 GC 0 ms, Pauses 0 ms < 0 ms
MIN rand_large 1.705 s, 165 MB, 6884 GC 593 ms, Pauses 281 ms < 0 ms
MIN rand_small 4.133 s, 22 MB, 3596 GC 555 ms, Pauses 230 ms < 0 ms
MIN slist 1.056 s, 35 MB, 19 GC 45 ms, Pauses 44 ms < 2 ms
MIN testgc3 1.508 s, 330 MB, 11 GC 45 ms, Pauses 42 ms < 9 ms
MIN tree1 0.334 s, 12 MB, 34 GC 33 ms, Pauses 31 ms < 1 ms
MIN tree2 0.727 s, 1 MB, 216 GC 5 ms, Pauses 4 ms < 0 ms
MIN vdparser 1.276 s, 70 MB, 13 GC 29 ms, Pauses 24 ms < 3 ms
MIN words 0.721 s, 396 MB, 5 GC 10 ms, Pauses 8 ms < 7 ms
The benchmarks are not very reproducable as can be seen by the different number of GC invocations, but the trend seems to be: tests that only run a few but longer GC cycles (testgc3, vdparser) benefit from the parallel scanning (16 threads in my case), but others with many but shorter collections take longer overall. I suspect that comes from the synchronization overhead.
Note that an auto-reset event might actually be appropriate in that case! The thread that wakes up can re trigger if there's still data left.
Indeed, maybe too many threads are fighting over the next stack entry.
Switched back to master, as it might not come without side-effects. I also pushed the simplified event handling.
Note that an auto-reset event might actually be appropriate in that case! The thread that wakes up can re trigger if there's still data left.
Indeed, maybe too many threads are fighting over the next stack entry.
Implemented that, the results improve quite a bit:
MIN conalloc 0.775 s, 5 MB, 2963 GC 278 ms, Pauses 275 ms < 0 ms
MIN conappend 0.196 s, 5 MB, 758 GC 73 ms, Pauses 71 ms < 0 ms
MIN concpu 0.795 s, 5 MB, 3548 GC 305 ms, Pauses 302 ms < 0 ms
MIN conmsg 1.267 s, 12 MB, 58 GC 10 ms, Pauses 6 ms < 1 ms
MIN dlist 1.088 s, 22 MB, 51 GC 69 ms, Pauses 67 ms < 1 ms
MIN huge_single 0.024 s, 1501 MB, 2 GC 0 ms, Pauses 0 ms < 0 ms
MIN rand_large 1.578 s, 162 MB, 8737 GC 439 ms, Pauses 200 ms < 0 ms
MIN rand_small 3.284 s, 22 MB, 3596 GC 393 ms, Pauses 169 ms < 0 ms
MIN slist 1.075 s, 35 MB, 19 GC 48 ms, Pauses 47 ms < 3 ms
MIN testgc3 1.524 s, 330 MB, 11 GC 42 ms, Pauses 38 ms < 8 ms
MIN tree1 0.349 s, 12 MB, 85 GC 58 ms, Pauses 56 ms < 1 ms
MIN tree2 0.499 s, 1 MB, 216 GC 4 ms, Pauses 3 ms < 0 ms
MIN vdparser 1.253 s, 70 MB, 13 GC 26 ms, Pauses 22 ms < 2 ms
MIN words 0.722 s, 396 MB, 5 GC 10 ms, Pauses 9 ms < 7 ms
It doesn't seem to harm the short GC cycles anymore, but the longer ones benefit from parallelism. BTW: do you have a more reliable set of benchmarks?
I don't have any benchmarks that are public. The fluctuation is pretty high! For instance, you had conalloc going from 1723 cycles to 3356 cycles. How is that possible? If I read the code correctly, it's not even allocating that many times? 2500 allocations?
Ohh! there is this loop: https://github.com/dlang/dmd/blob/8762d1eaee42119269b82a1b1b7063c89c1e6a69/druntime/benchmark/gcbench/conalloc.d#L39-L44
So this is going to be a huge wildcard. Why do we have this loop in there?
So this is going to be a huge wildcard. Why do we have this loop in there?
I guess to get at least some GC activitity. If I disable the loop, I get
MIN conalloc 0.469 s, 5 MB, 1 GC 0 ms, Pauses 0 ms < 0 ms
MIN conappend 0.122 s, 5 MB, 40 GC 6 ms, Pauses 5 ms < 0 ms
MIN concpu 0.469 s, 5 MB, 1 GC 0 ms, Pauses 0 ms < 0 ms
I remember that I used to mostly ignore these concurrency tests as they yield to inconsistent results.
Maybe up the N value to get more activity? Or limit the number of times this loop can run?
In any case, I don't think we can trust the total time numbers there as it will fluctuate with the number of times that loop can trigger a GC, but maybe dividing the time by the number of GC cycles is a reasonable measurement?
Doing a re-read of the whole parallel section, I think it looks a lot better. I think the busyThreads race is gone now, and the event for data ready looks much cleaner.
I have a bit of concern doing the read of length to be 0 without atomics, and I think the evDone event can just be removed.
Maybe up the N value to get more activity? Or limit the number of times this loop can run?
In any case, I don't think we can trust the total time numbers there as it will fluctuate with the number of times that loop can trigger a GC, but maybe dividing the time by the number of GC cycles is a reasonable measurement?
I think these tests are too CPU heavy and don't really test the GC. Increasing N will only let the execution time explode with only a fraction spent in the GC, so we should just ignore them for now.
I made the "ultimate" test, building all the phobos unittests with all files on a single command line. Here are the results with different GC options:
command Process Mem / Time
dmd: 19.4 GB 1:58.830 min:sec GC memory:10187 MB, 1 GC 3212 ms, Pauses 1772 ms < 1772 ms
dmd -lowmem parallel:0: 12.3 GB 2:50.590 min:sec GC memory: 7883 MB, 18 GC 46983 ms, Pauses 45594 ms < 7100 ms
dmd -lowmem parallel:1: 12.3 GB 2:31.680 min:sec GC memory: 7883 MB, 18 GC 28956 ms, Pauses 27538 ms < 4231 ms
dmd -lowmem parallel:4: 12.3 GB 2:18.060 min:sec GC memory: 7883 MB, 18 GC 17524 ms, Pauses 16081 ms < 2404 ms
dmd -lowmem parallel:7: 12.3 GB 2:17.120 min:sec GC memory: 7883 MB, 18 GC 12620 ms, Pauses 11117 ms < 1585 ms
dmd -lowmem parallel:11: 12.3 GB 2:10.860 min:sec GC memory: 7883 MB, 18 GC 11585 ms, Pauses 10079 ms < 1409 ms
dmd -lowmem parallel:15: 12.3 GB 2:12.030 min:sec GC memory: 7883 MB, 18 GC 11766 ms, Pauses 10310 ms < 1433 ms
dmd -lowmem disable:1 20.0 GB 2:17.370 min:sec GC memory:15435 MB, 1 GC 5564 ms, Pauses 4770 ms < 4770 ms
dmd -lowmem gc:precise 12.4 GB 2:21.620 min:sec GC memory: 7883 MB, 18 GC 12570 ms, Pauses 11201 ms < 1563 ms
dmd BumpGC: 30.0 GB 2:01.140 min:sec
dmd cleanup:none 19.4 GB 1:55.280 min:sec GC memory:10187 MB, 0 GC 0 ms, Pauses 0 ms < 0 ms
- dmd is compiled from this branch with a dmd built from this branch.
- this is run on a mobile AMD Ryzen 7 8845HS with 8 cores+HT and 64 GB RAM,. The latter causes the GC to run rather seldomly.
- measurements can vary by a couple of seconds
- parallel:1 should be similar to the currently released version
- GC and pause times scale rather well up to parallel:7 (8 threads), though a bit slower than 1/threads. After that there is not that much improvement, probably due to hyper-threading not being full cores.
- even though dmd is meant to use the bump allocator by default it still allocates half of its memory with the disabled GC. This is because the lowered druntime calls (e.g. for new class) don't use the callbacks anymore that dmd hooks into.
- "BumpGC" selects a GC that just forwards malloc to the bump allocator. It is not faster than the version using the GC for half the allocations, but uses about 10 GB more memory. Not sure why that is, I don't think that there are so many array operations that might reallocate instead of appending.
- cleanup:none skips the final collection before shutdown, which is probably also the longest, a simple win of up to 2% execution time
I have a bit of concern doing the read of length to be 0 without atomics, and I think the evDone event can just be removed.
I think the check while the lock is held is good enough, too. That should cause no big penalty as the thread will wait for the next evStackFilled event anyway if the stack is actually empty.
- even though dmd is meant to use the bump allocator by default it still allocates half of its memory with the disabled GC. This is because the lowered druntime calls (e.g. for new class) don't use the callbacks anymore that dmd hooks into.
In a recent DLF meeting, we talked about switching the bump GC to be a full GC interface, so we don't have to hook any of these things, just change the underlying allocator. Really, a bump allocator is a GC, just garbage is collected at the end of the process.
In a recent DLF meeting, we talked about switching the bump GC to be a full GC interface, so we don't have to hook any of these things, just change the underlying allocator. Really, a bump allocator is a GC, just garbage is collected at the end of the process.
Some GC operations are not easily implemented with the bump allocator, like GC.realloc, if you don't keep track of allocation sizes. D arrays can use huge amounts of memory if they cannot query allocation sizes, too. Having to support this decreases the benefits of the bump allocator. (I don't know why the test above worked with the "BumpGC", later it crashed due to the realloc-issue).
@schveiguy Now pulled out the locking from the scan stack, so evStackFilled could be converted back to being manual reset and busyThreads changed under the same lock. Should be very similar to your new GC now.
Ah nice, I see evDone is now only waited on by the main thread. This looks much better. I'm going to give it another once-over.
tracking of busy threads looks much better BTW!
thank you @rainers!
Thanks @schveiguy for the reviews.