julia icon indicating copy to clipboard operation
julia copied to clipboard

GC mark-loop rewrite

Open d-netto opened this issue 2 years ago • 21 comments

This PR simplifies the GC mark-loop code. Some benchmark results are shown here: https://hackmd.io/XZDp98lTR3qyYgsWKCTvnQ. I believe these changes could bring some performance benefits as well (particularly if used with prefetching in the mark-loop, c.f. Cher 2004).

Previous work

Since https://github.com/JuliaLang/julia/pull/21590, the GC mark-loop was implemented by keeping two manually managed stacks: one of which contained iterator states used to keep track of the object currently being marked. As an example, to mark arrays, we would pop the corresponding iterator state from the stack, iterate over the array until we found an unmarked reference, and if so, we would update the iterator state (to reflect the index we left off), "repush" it into the stack and proceed with marking the reference we just found.

This PR

This PR eliminates the need of keeping the iterator states by modifying the object graph traversal code. We keep a single stack of jl_value_t * currently being processed. To mark an object, we first pop it from the stack, push all unmarked references into the stack and proceed with marking.

I believe this doesn't break any invariant from the generational GC. Indeed, the age bits are set after marking (if the object survived one GC cycle it's moved to the old generation), so this new traversal scheme wouldn't change the fact of whether an object had references to old objects or not. Furthermore, we must not update GC metadata for objects in the remset, and we ensure this by calling gc_mark_outrefs in gc_queue_remset with meta_updated set to 1.

Additional advantages

  1. There are no recursive function calls in the GC mark-loop code (one of the reasons why https://github.com/JuliaLang/julia/pull/21590 was implemented).
  2. Keeping a single GC queue will greatly simplify work-stealing in the multi-threaded GC we are working on (c.f. https://github.com/JuliaLang/julia/pull/45639).
  3. Arrays of references, for example, are now marked on a regular stride fashion, which could help with hardware prefetching.
  4. We can easily modify the traversal mode (to breadth first, for example) by only changing the jl_gc_markqueue_t(from LIFO to FIFO, for example) methods without touching the mark-loop itself, which could enable further exploration on the GC in the future.

TODO

  • [x] Fix GC debugging infrastructure.
  • [x] Run more benchmarks.
  • [x] Test whether prefetching on the mark-loop (c.f. Cher, 2004) would improve GC times.

CC: @vchuravy, @chflood, @kpamnany

d-netto avatar Jun 07 '22 21:06 d-netto

To mark an object, we first pop it from the stack, push all unmarked references into the stack and proceed with marking.

So IIUC this changed from using a depth first search (DFS) to a breadth first search (BFS). The main reason the marking is done in the current way was because of @vtjnash 's concern on the memory consumption. It might not be trivial to measure but I assume it's mainly affecting cases when there are complex objects and large arrays (i.e. things with lots of objects).

yuyichao avatar Jun 07 '22 21:06 yuyichao

If your work-list is a stack, still DFS.

But yes, this could increase memory consumption for the GC queues in the case of very large arrays (and a single threaded collector) you mentioned above. Not sure if the memory tradeoff is that obvious for a parallel GC (some work-stealing queues are implemented as circular buffers).

d-netto avatar Jun 07 '22 22:06 d-netto

BFS was the wrong name I guess, but this is the order that I was trying to avoid for the current version and the only reason it's done this way (cuz just collecting all the reference is of course the more natural and easy way to do it.......). Strictly maintaining the order makes sure this is not going to be any more of an issue compared to the version it replaced.

I'm not sure how much the impact would be (it should be larger by the number of references though of course the memory per object is less) since I don't know the average number of references per object in a typical workload. It is quite important to make sure the single thread case is not affected by much, especially regarding any worst cases...

yuyichao avatar Jun 08 '22 02:06 yuyichao

I believe this is a good change. We probably want to add some special handling code for chunking big arrays at some point. I ran this and on my machine list.jl got significantly worse: Before: test = "list.jl" run time: 4829ms min, 7394ms max 6081ms median gc time: 3060ms min, 5584ms max, 4337ms median

After: test = "list.jl" run time: 5855ms min, 8779ms max 7871ms median gc time: 4022ms min, 6953ms max, 6006ms median

The example builds a linked list which should behave similarly given either strategy so I can't really explain the results I'm seeing.

Regardless I don't think this is a show stopper and I believe the code simplification of not managing shadow queues is a big win.

chflood avatar Jun 08 '22 20:06 chflood

Regardless I don't think this is a show stopper and I believe the code simplification of not managing shadow queues is a big win.

We should understand the performance better first.

vchuravy avatar Jun 09 '22 13:06 vchuravy

Test whether prefetching on the mark-loop (c.f. Cher, 2004) would improve GC times.

Issue is that you need to read the tag whenever you enqueue an object, so you incur a cache miss there and prefetching objs when you enqueue them doesn't make sense.

Maybe we could change the order of reading tags so that we could enable prefetching in the future, but I think that would be another PR...

d-netto avatar Jun 24 '22 15:06 d-netto

TODO list should be complete now. Updated link at the top, which contains benchmark results for one of the latest commits.

Interestingly, list.jl had opposite trends on AMD/Intel machines, though it's not clear to me why.

I also ran this on https://github.com/h2oai/db-benchmark and found out that queue sizes didn't go beyond 2^17 (read somewhere this was the default capacity in jvm) there.

Both (master/pr) modes of traversal have adversarial examples in which the queues can get very large. I believe we should be asking whether these pathological examples would be common enough to justify batching/chunking.

d-netto avatar Jul 05 '22 23:07 d-netto

@nanosoldier runtests(ALL, vs = ":master", buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"], vs_buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"])

vchuravy avatar Jul 06 '22 01:07 vchuravy

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

nanosoldier avatar Jul 06 '22 13:07 nanosoldier

@nanosoldier runtests(["BrainFlow", "CUDD", "CVRPLIB", "ControlSystemIdentification", "Downloads", "DynamicalSystemsBase", "EarlyStopping", "GDAL", "Hashpipe", "IndependentComponentAnalysis", "IntensityScans", "Manopt", "MusicManipulations", "NavAbilitySDK", "Nonconvex", "NumericalEFT", "OteraEngine", "PlantHydraulics", "ProgressiveHedging", "PyMNE", "QuadEig", "QuantumTomography", "Reactive", "ShipMMG", "StateSpaceModels", "TopologyPreprocessing", "Wandb"], vs = ":master", buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"], vs_buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"])

maleadt avatar Jul 06 '22 14:07 maleadt

https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/by_hash/8dae1bb_vs_d6c5cb6/ControlSystemIdentification.primary.log is https://github.com/JuliaLang/julia/issues/45444

I haven't seen https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/by_hash/8dae1bb_vs_d6c5cb6/PlantHydraulics.primary.log before

vchuravy avatar Jul 06 '22 15:07 vchuravy

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

nanosoldier avatar Jul 07 '22 00:07 nanosoldier

@nanosoldier runbenchmarks(!"scalar", vs=":master")

vchuravy avatar Jul 08 '22 00:07 vchuravy

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

nanosoldier avatar Jul 08 '22 06:07 nanosoldier

I haven't seen https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/by_hash/8dae1bb_vs_d6c5cb6/PlantHydraulics.primary.log before

Lot of build/test failures went away after merging master. Perhaps nanosoldier it again?

d-netto avatar Jul 12 '22 00:07 d-netto

@nanosoldier runtests(ALL, vs = ":master", buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"], vs_buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"])

maleadt avatar Jul 12 '22 05:07 maleadt

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

nanosoldier avatar Jul 13 '22 00:07 nanosoldier

@yuyichao, it seems that the remset is flushed at the beginning of every collection and reconstructed as the graph is traversed.

Do you know what is the motivation behind this, instead of just flushing the remset on full collections?

I tried to only flush the resmet on full collections and never re-add objects to it during the traversal (e.g. commented the push_remset functions) but this leads to a segfault. Do you know why this could be the case (EDIT: can include backtrace if it helps)?

My only guess is that there could be old->young references that are somehow not recorded in the write barrier, but this doesn't seem right...

d-netto avatar Jul 31 '22 16:07 d-netto

Do you know what is the motivation behind this, instead of just flushing the remset on full collections?

I think that'll be too inaccurate. Promotion happens on generational collection as well and not resetting the remset means that any newly promoted old objects has to wait until the next full collection for the promotion to actually become beneficial.

commented the push_remset functions

I don't think this would work directly at the least (assuming you've got all other potential corner cases right). As mentioned above, promotion will happen between full collection and the marking pass has to discover any new old->new generated because of this.

yuyichao avatar Aug 05 '22 02:08 yuyichao

As mentioned above, promotion will happen between full collection and the marking pass has to discover any new old->new generated because of this.

I can only see old->young references being created on a small collection if there is a subset of objects (EDIT: young gen) that is promoted and another that's not. Is that the case?

d-netto avatar Aug 05 '22 16:08 d-netto

I can only see old->young references being created on a small collection if there is a subset of objects that is promoted and another that's not. Is that the case?

I assume you are asking whether the GC will promote a non-trivial subset of the live objects to old? Yes, the promotion is independent for different objects and since it took more than one cycle to promote the objects won't be sync'd. See the state transition diagram in the comment gc.c.

yuyichao avatar Aug 05 '22 16:08 yuyichao

Superseded by https://github.com/JuliaLang/julia/pull/47292.

d-netto avatar Oct 22 '22 18:10 d-netto