GCWorker instead of VMWorkerThread
Many internal functions as well as public API functions have tls: VMWorkerThread as argument. Although they were intended for VM bindings to have thread-local contexts inside those functions, they are cumbersome to use, especially when we need to add work packets.
We can grep the strong tls: VMWorkerThread to get a full list. Here I give some examples:
-
Plan::prepare(&mut self, tls: VMWorkerThread) -
Plan::release(&mut self, tls: VMWorkerThread) -
Plan::end_of_gc(&mut self, tls: VMWorkerThread) -
xxx_mutator_prepare<VM: VMBinding>(mutator: &mut Mutator<VM>, tls: VMWorkerThread) -
xxx_mutator_release<VM: VMBinding>(mutator: &mut Mutator<VM>, tls: VMWorkerThread) -
Collection::stop_all_mutators<F>(tls: VMWorkerThread, mutator_visitor: F) - `Collection::resume_mutators(tls: VMWorkerThread)
-
Scanning::scan_object<...>(tls: VMWorkerThread, ...) -
Scanning::scan_vm_specific_roots(tls: VMWorkerThread, ...)
One thing is for sure that those functions are all executed by GC worker threads, (probably with Scanning::scan_object also being called by mutator threads in write barriers). tls: VMWorkerThread is the VM-level thread-local context of a GC worker thread. Admittedly, it is there to allow the VM binding use such context.
But the MMTk-level thread-local context of a GC worker thread is the GCWorker<VM>, and provides accesses to the GC-related functionality, such as a reference to the scheduler, and the ability to add work packets. It often happens that
- Inside MMTk core, some plans want to add work packets in one of those functions. For example, a
ConcurrentImmixplan may add work packets to clear side metadata in parallel inConcurrentImmix::prepare. - In VM bindings, some VMs want to add work packets for processing VM-specific data structures. For example, the OpenJDK binding wants to add work packets to scan VM-specific roots, and fix nmethod relocations.
Currently, both the VM binding and parts of mmtk-core rely on memory_manager::add_work_packet(mmtk, bucket, packet), and rely on the presence of mmtk: &'static MMTK<VM>.
- In MMTk core,
ProcessEdgesWorkRootsWorkFactoryholds a reference to&'static MMTK<VM>, and it gets themmtkreference fromGCWork::do_work(&mut self, worker, mmtk). - In the OpenJDK binding,
scan_vm_specific_rootspasses the globalSINGLETONtomemory_manager::add_work_packet(mmtk, bucket, packet).
And when we want to add work packets in ConcurrentImmix::prepare, we found that there is no reference to the MMTK instance, and there is no reference to a GCWorker<VM> which contains a reference to the MMTK instance. Fortunately, (and weirdly,) ImmixSpace has a reference to scheduler: GCWorkScheduler (but CopySpace doesn't. Neither do LargeObjectSpace, ImmortalSpace nor VMSpace have a reference to the scheduler). The method ImmixSpace::scheduler() was particularly used for adding work packets, working around the limitation.
Why was there tls: VMWorkerThread?
As for the API, there was an assumption since the beginning of the development of Rust MMTk, that the VM is usually implemented in a native language, usually not Java. So it made more sense to pass the VM-specific tls: VMWorkerThread instead of the MMTk-specific GCWorker.
As for the internal parts of mmtk-core, I guess the mmtk-core was trying to pass as few parameters as possible between components of mmtk-core. It passed tls: VMWorkerThread around just for making the MMTkCore-to-VMBinding calls possible.
But Plan::prepare has the tls parameter for no reason. It passes it to CommonPlan::prepare, then down to BasePlan::prepare, and then the _tls parameter` is unused.
What should we do?
All methods of Plan that are supposed to be called only by GC worker threads should have worker: &mut GCWorker<VM> as the second parameter (after self) instead of tls: VMWorkerThread.
And remove ImmixSpace::scheduler() which is a workaround.
In API functions, mainly in Collection and Scanning, pass worker: GCWorker<VM> instead of tls: VMWorkerThread.
We need to think carefully about Scanning::scan_object and Scanning::scan_object_and_trace_edges. As discussed in https://github.com/mmtk/mmtk-core/issues/1375, we may loosen tls: VMWorkerThread to tls: VMThread. But that goes against this proposal. Making it Either<GCWorker<VM>, VMMutatorThread> is overly complex and may add an overhead of wrapping, while object scanning is usually part of the hottest tracing loop in MMTK. We may make them an exception and pass tls: VMThread instead, requiring the VM binding to check if it is a mutator thread or GC worker thread. Alternatively, we add a method Scanning::scan_object_for_mutator specifically for this purpose, which was also a proposal in https://github.com/mmtk/mmtk-core/issues/1375.
I think you are mixing concerns:
- When we pass around
tls, usually there is a call to the binding (or there used to be a call to the binding). Iftlsis unused, possibly the call to the binding was removed and it becomes dead code. We can remove it. - Both plans and policies may need to access the scheduler. I think they should hold a reference to the scheduler. I recently added a reference to the scheduler to
BasePlan. Some policies hold a reference to the scheduler, and we could move the reference to theCommonSpaceas well.
The binding needs tls, and plans and policies in MMTk core need the scheduler. But none of them really need the GC worker. I don't feel passing the GC worker reference around is a good idea.
If tls is unused, possibly the call to the binding was removed and it becomes dead code. We can remove it.
I think it may be true for CommonPlan. Currently other spaces like LOS don't generate new work packets when preparing, and the tls is "dead" in this sense. But as a principle (https://github.com/mmtk/mmtk-core/issues/1147), any significant amount of preparing work should be off-loaded to a work packet. A &mut GCWorker<VM> (or a reference to scheduler as you suggested) may solve the problem.
Both plans and policies may need to access the scheduler. I think they should hold a reference to the scheduler.
I feel uncomfortable when many small parts of a large object hold references to the same object. For example, each Allocator has a tls: VMThread (for mutator and copying GC), but each Mutator or a GCWorkerCopyContext has both a tls and many allocator instances embedded into it. I feel this is repetitive. If each space has a reference to the scheduler, it will be similar.
What I think should happen is that when calling Plan::prepare, we pass the &mut GCWorker<VM> instance to it. A plan is, as the name suggests, a "plan" for the mutators and the GC workers to manage memory, and Plan::prepare is an instruction for the GC worker. So Plan::prepare is well-entitled to have &mut GCWorker<VM> as a parameter. And it is possible because when each each work packet is executed (GCWorker::do_work), it will receive the &mut GCWorker<VM> argument from the calling GC worker thread.
And the GCWorker structure allows the program to make use of worker-specific information. The worker ordinal allows it to access data structures indexed by the worker ordinal, such as crate::util::heap::blockpageresource::BlockPool. It also allows the function to split the work by the number of GC workers, for the sake of load-balancing. I am also considering WorkerLocal data structures for some time (https://github.com/mmtk/mmtk-core/issues/1375). The fact that a thread holds a mutable reference to a GC worker can be used as a key to such as data structure for accessing data without race.
FWIW, I've changed all the prepare and release to use &mut GCWorker<VM> in my fork to add support for single threaded work packets. It's more useful than passing tls around. Plus you can always do worker.tls to get the tls if you end up needing to call into the binding.
My personal opinion is that we should keep (internal) APIs consistent/mirrored. If release needs a &mut GCWorker<VM> but prepare does not, then we should still keep the parameter because it may be required in the future.
the GCWorker structure allows the program to make use of worker-specific information. The worker ordinal allows it to access data structures indexed by the worker ordinal, such as crate::util::heap::blockpageresource::BlockPool. It also allows the function to split the work by the number of GC workers, for the sake of load-balancing.
I don't think those are good examples of why plans need GCWorker. Those sound like leaking abstraction. Plans just need to produce reasonable sized work packets, and it is the scheduler that takes care of the load balancing. A plan should never generates 2 work packets because we only have 2 GC threads, or generates 10 packets for 10 GC threads.
FWIW, I've changed all the prepare and release to use &mut GCWorker<VM> in my fork to add support for single threaded work packets.
What exactly do you need from &mut GCWorker for single threaded work packet?
Because we don't schedule extra work packets for the single threaded GC case. You need to pass the GC worker struct to perform the work. So for example, for Immix we need to clear the mark table in the prepare work bucket. The current code will create and schedule the work packets there. But since we don't want to schedule work packets, we just pass the GC worker struct there and directly run those work packets. See here for an example: https://github.com/k-sareen/mmtk-core/commit/9791d9a73fcff055da83fd7b262f332a916ef438#diff-773ce641225add0513dc4e369e385fa3bb5c21e37abe9707e623bd516392205aR480-R509 (the prepare_blocks function).
... Those sound like leaking abstraction. Plans just need to produce reasonable sized work packets, and it is the scheduler that takes care of the load balancing. A plan should never generates 2 work packets because we only have 2 GC threads, or generates 10 packets for 10 GC threads.
I don't think it leaks abstraction. The number of workers is part of GC, and Plan::prepare is a guideline for doing GC. For example, in Flood et al's implementation of ~~Compressor~~ Parallel LISP-2 mark compact, certain phases divided the space into regions, and the number of regions is proportional to the number of GC workers. It is debatable whether it is better to have fixed-size regions distributed to GC workers using work packet load-balanding, or having variable-size regions depending on the number of GC workers. But exposing information about the workers allows plans to describe its algorithm in terms of GC workers.
And worker-local data structures like BlockQueue allows efficient multi-worker access without locking. It is natural that during Prepare, such data structures need to be initialized. It is arguable whether we need &GCWorker to initialize such data structures, or whether a &GCWorkScheduler is sufficient. But I think there may be cases that the current GC worker that is initializing the data structure needs to access it immediately.
(Flood et al implement a parallel Lisp-2 mark compact, not the Compressor)
Because we don't schedule extra work packets for the single threaded GC case. You need to pass the GC worker struct to perform the work. So for example, for Immix we need to clear the mark table in the prepare work bucket. The current code will create and schedule the work packets there. But since we don't want to schedule work packets, we just pass the GC worker struct there and directly run those work packets. See here for an example: k-sareen@9791d9a#diff-773ce641225add0513dc4e369e385fa3bb5c21e37abe9707e623bd516392205aR480-R509 (the
prepare_blocksfunction).
I don't feel this is a good way to implement single threaded GC. Ideally the plan and the policy do not need to know if the GC is single threaded or multi threaded -- they use the same scheduler API. The scheduler would behave differently for single threaded GC. I am not sure if that is possible or not, but I think we should try our best efforts to encapsulate this difference into the scheduler.
It's more efficient to do it this way than to push work packets to the scheduler. There's scheduling overhead involved there. All I can say is that doing it like the way above is much more efficient than pushing work packets and letting the scheduler assign work. (We've measured this in the context of OpenJDK and ART).
I don't feel this is a good way to implement single threaded GC
It's more than single threaded GC. It's a GC done entirely within a single work packet. (Yes, we create work packets internally, but those are consumed directly/immediately). We have single-threaded GC trivially right now by setting number of GC threads to 1.
There's scheduling overhead involved there
I understand that. What I meant is that the scheduler could be conditionally compiled for 'single thread GC' to remove those overheads. It would be great if the difference of 'single thread GC' and normal multiple threaded GC can be encapsulated within the scheduler.
Sure. All I was saying is that I think this change makes sense given I myself did something similar.