julia
julia copied to clipboard
GC mark-loop rewrite
This PR simplifies the GC mark-loop code. Some benchmark results are shown here: https://hackmd.io/XZDp98lTR3qyYgsWKCTvnQ. I believe these changes could bring some performance benefits as well (particularly if used with prefetching in the mark-loop, c.f. Cher 2004).
Previous work
Since https://github.com/JuliaLang/julia/pull/21590, the GC mark-loop was implemented by keeping two manually managed stacks: one of which contained iterator states used to keep track of the object currently being marked. As an example, to mark arrays, we would pop the corresponding iterator state from the stack, iterate over the array until we found an unmarked reference, and if so, we would update the iterator state (to reflect the index we left off), "repush" it into the stack and proceed with marking the reference we just found.
This PR
This PR eliminates the need of keeping the iterator states by modifying the object graph traversal code. We keep a single stack of jl_value_t *
currently being processed. To mark an object, we first pop it from the stack, push all unmarked references into the stack and proceed with marking.
I believe this doesn't break any invariant from the generational GC. Indeed, the age bits are set after marking (if the object survived one GC cycle it's moved to the old generation), so this new traversal scheme wouldn't change the fact of whether an object had references to old objects or not. Furthermore, we must not update GC metadata for objects in the remset
, and we ensure this by calling gc_mark_outrefs
in gc_queue_remset
with meta_updated
set to 1.
Additional advantages
- There are no recursive function calls in the GC mark-loop code (one of the reasons why https://github.com/JuliaLang/julia/pull/21590 was implemented).
- Keeping a single GC queue will greatly simplify work-stealing in the multi-threaded GC we are working on (c.f. https://github.com/JuliaLang/julia/pull/45639).
- Arrays of references, for example, are now marked on a regular stride fashion, which could help with hardware prefetching.
- We can easily modify the traversal mode (to breadth first, for example) by only changing the
jl_gc_markqueue_t
(from LIFO to FIFO, for example) methods without touching the mark-loop itself, which could enable further exploration on the GC in the future.
TODO
- [x] Fix GC debugging infrastructure.
- [x] Run more benchmarks.
- [x] Test whether prefetching on the mark-loop (c.f. Cher, 2004) would improve GC times.
CC: @vchuravy, @chflood, @kpamnany
To mark an object, we first pop it from the stack, push all unmarked references into the stack and proceed with marking.
So IIUC this changed from using a depth first search (DFS) to a breadth first search (BFS). The main reason the marking is done in the current way was because of @vtjnash 's concern on the memory consumption. It might not be trivial to measure but I assume it's mainly affecting cases when there are complex objects and large arrays (i.e. things with lots of objects).
If your work-list is a stack, still DFS.
But yes, this could increase memory consumption for the GC queues in the case of very large arrays (and a single threaded collector) you mentioned above. Not sure if the memory tradeoff is that obvious for a parallel GC (some work-stealing queues are implemented as circular buffers).
BFS was the wrong name I guess, but this is the order that I was trying to avoid for the current version and the only reason it's done this way (cuz just collecting all the reference is of course the more natural and easy way to do it.......). Strictly maintaining the order makes sure this is not going to be any more of an issue compared to the version it replaced.
I'm not sure how much the impact would be (it should be larger by the number of references though of course the memory per object is less) since I don't know the average number of references per object in a typical workload. It is quite important to make sure the single thread case is not affected by much, especially regarding any worst cases...
I believe this is a good change. We probably want to add some special handling code for chunking big arrays at some point. I ran this and on my machine list.jl got significantly worse: Before: test = "list.jl" run time: 4829ms min, 7394ms max 6081ms median gc time: 3060ms min, 5584ms max, 4337ms median
After: test = "list.jl" run time: 5855ms min, 8779ms max 7871ms median gc time: 4022ms min, 6953ms max, 6006ms median
The example builds a linked list which should behave similarly given either strategy so I can't really explain the results I'm seeing.
Regardless I don't think this is a show stopper and I believe the code simplification of not managing shadow queues is a big win.
Regardless I don't think this is a show stopper and I believe the code simplification of not managing shadow queues is a big win.
We should understand the performance better first.
Test whether prefetching on the mark-loop (c.f. Cher, 2004) would improve GC times.
Issue is that you need to read the tag whenever you enqueue an object, so you incur a cache miss there and prefetching objs when you enqueue them doesn't make sense.
Maybe we could change the order of reading tags so that we could enable prefetching in the future, but I think that would be another PR...
TODO list should be complete now. Updated link at the top, which contains benchmark results for one of the latest commits.
Interestingly, list.jl
had opposite trends on AMD/Intel machines, though it's not clear to me why.
I also ran this on https://github.com/h2oai/db-benchmark and found out that queue sizes didn't go beyond 2^17 (read somewhere this was the default capacity in jvm) there.
Both (master/pr) modes of traversal have adversarial examples in which the queues can get very large. I believe we should be asking whether these pathological examples would be common enough to justify batching/chunking.
@nanosoldier runtests(ALL, vs = ":master", buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"], vs_buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"])
Your package evaluation job has completed - possible new issues were detected. A full report can be found here.
@nanosoldier runtests(["BrainFlow", "CUDD", "CVRPLIB", "ControlSystemIdentification", "Downloads", "DynamicalSystemsBase", "EarlyStopping", "GDAL", "Hashpipe", "IndependentComponentAnalysis", "IntensityScans", "Manopt", "MusicManipulations", "NavAbilitySDK", "Nonconvex", "NumericalEFT", "OteraEngine", "PlantHydraulics", "ProgressiveHedging", "PyMNE", "QuadEig", "QuantumTomography", "Reactive", "ShipMMG", "StateSpaceModels", "TopologyPreprocessing", "Wandb"], vs = ":master", buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"], vs_buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"])
https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/by_hash/8dae1bb_vs_d6c5cb6/ControlSystemIdentification.primary.log is https://github.com/JuliaLang/julia/issues/45444
I haven't seen https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/by_hash/8dae1bb_vs_d6c5cb6/PlantHydraulics.primary.log before
Your package evaluation job has completed - possible new issues were detected. A full report can be found here.
@nanosoldier runbenchmarks(!"scalar", vs=":master")
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.
I haven't seen https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/by_hash/8dae1bb_vs_d6c5cb6/PlantHydraulics.primary.log before
Lot of build/test failures went away after merging master. Perhaps nanosoldier it again?
@nanosoldier runtests(ALL, vs = ":master", buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"], vs_buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"])
Your package evaluation job has completed - possible new issues were detected. A full report can be found here.
@yuyichao, it seems that the remset
is flushed at the beginning of every collection and reconstructed as the graph is traversed.
Do you know what is the motivation behind this, instead of just flushing the remset
on full collections?
I tried to only flush the resmet
on full collections and never re-add objects to it during the traversal (e.g. commented the push_remset
functions) but this leads to a segfault. Do you know why this could be the case (EDIT: can include backtrace if it helps)?
My only guess is that there could be old->young references that are somehow not recorded in the write barrier, but this doesn't seem right...
Do you know what is the motivation behind this, instead of just flushing the remset on full collections?
I think that'll be too inaccurate. Promotion happens on generational collection as well and not resetting the remset means that any newly promoted old objects has to wait until the next full collection for the promotion to actually become beneficial.
commented the push_remset functions
I don't think this would work directly at the least (assuming you've got all other potential corner cases right). As mentioned above, promotion will happen between full collection and the marking pass has to discover any new old->new generated because of this.
As mentioned above, promotion will happen between full collection and the marking pass has to discover any new old->new generated because of this.
I can only see old->young references being created on a small collection if there is a subset of objects (EDIT: young gen) that is promoted and another that's not. Is that the case?
I can only see old->young references being created on a small collection if there is a subset of objects that is promoted and another that's not. Is that the case?
I assume you are asking whether the GC will promote a non-trivial subset of the live objects to old? Yes, the promotion is independent for different objects and since it took more than one cycle to promote the objects won't be sync'd. See the state transition diagram in the comment gc.c
.
Superseded by https://github.com/JuliaLang/julia/pull/47292.