graal icon indicating copy to clipboard operation
graal copied to clipboard

Shenandoah support

Open rkennke opened this issue 9 months ago • 24 comments

This implements the barriers that are needed to run with Shenandoah GC in the Graal compiler. (Issue: https://github.com/oracle/graal/issues/3472)

There are 3 basic kinds of barriers needed for Shenandoah:

  • SATB barriers (aka pre-write-barrier, also needed for Reference.get() support). Those are pretty similar to G1's SATB barriers. Those barriers are inserted before reference stores, or in the case of Reference.get(), after the load of the referent. The SATB barriers are inserted in the node graph, and expanded to assembly in the respective backends. (Compared to the G1 backend, we implemented a slight improvement, where we move the mid-path-section into an out-of-line stub, similar to the slow-path. This should improve performance by helping static branch prediction. We may want to change G1 barriers in a similar fashion.)
  • Load-reference-barriers (LRB). Those are conceptually similar to ZGC's read-barriers, but differ in the implementation. Those barriers, too, are inserted as nodes, and expanded to assembly in the backends.
  • Card-marking barriers. Those are only needed when running with generational Shenandoah, and are similar to Serial and Parallel GC's card-marking barriers. However, in contrast to Serial and Parallel GC, those Shenandoah card-barriers are again inserted as nodes, and expanded to assembly in the backends. (We may want to adapt this code in Serial and Parallel and ditch their snippets-based implementation.)

Notice that none of the barriers are implemented as snippets (like Serial/Parallel's card-barriers) or in the backend-only (as ZGC's read-barriers). We needed a way to efficiently deal with compressed-oops, which is not (easily) possible to do in the backend. In the node-graph this is pretty easy: insert the LRB with preceding and succeeding uncompress/compress after any load and before the (potential) uncompress (i.e. turn load->uncompress into load -> (uncompress -> lrb -> compress) -> uncompress) and then let the optimizer optimize away the trailing compress -> uncompress pairs.

In order to support this, we needed a few additions:

  • The compression nodes now have a method that allows to add them without using unique(). If we used unique(), then the uncompress before the LRB would be matched with the original uncompress after the load, and we would cut out the LRB.
  • We moved the barrier insertion for Shenandoah from the mid-tier to the low-tier. This is needed because we can't insert barriers to FloatingReadNodes. We moved the barrier insertion to after fixing the read-nodes, at which point this is safe to do. Other GCs keep adding their barriers in the mid-tier. The mechanics is that BarrierSet defaults to mid-tier, but implementations can override this to add barriers in low-tier (instead, or additionally).

X86 port contributed by @JohnTortugo.

Testing:

  • Renaissance
  • SPECjvm2008
  • SPECjbb2015
  • DaCapo

(We have run those workloads for correctness testing only, we have not (yet) conducted a performance study.)

rkennke avatar Mar 21 '25 20:03 rkennke

If you have any questions about how to structure your changes feel free to ping me over slack as I was the primary author of the new LIR support for barriers.

tkrodriguez avatar Mar 25 '25 16:03 tkrodriguez

If you have any questions about how to structure your changes feel free to ping me over slack as I was the primary author of the new LIR support for barriers.

Thanks, Tom! I will do that whenever I get stuck or have questions. So far I'm making progress. Structurally, Shenandoah will be look like a mix of ZGC (for the load-barrier, even though I am modeling it as a Node that consumes the loaded value, instead of replacing the ReadNode altogether) and G1 (for the SATB parts), and likely Serial/Parallel for the card-table parts.

rkennke avatar Mar 25 '25 16:03 rkennke

Sounds good. There's still some more work to finish out the switch to LIR only barriers but I think supporting G1 and ZGC covers the required strategies in a fairly pragmatic way.

tkrodriguez avatar Mar 25 '25 16:03 tkrodriguez

@tkrodriguez can you please take another look at this. Once done, we can ask @davleopo and @gergo- to look at it.

dougxc avatar May 19 '25 12:05 dougxc

I'll take a look.

tkrodriguez avatar May 19 '25 20:05 tkrodriguez

Are the gate failures actual problems? It would be good to see a clean gate. Also, we should squash the history before committing.

tkrodriguez avatar May 19 '25 20:05 tkrodriguez

Are the gate failures actual problems? It would be good to see a clean gate.

I don't know what those problems are. I fixed everything that looked related to my changes. Those failures look like infra problems, some volumes seem to have run out of memory or something. I doubt that it is related.

Also, we should squash the history before committing. Ok, I can do that - tomorrow.

rkennke avatar May 19 '25 20:05 rkennke

@tkrodriguez There seem to be GHA failures that report SerialWriteBarriers not being Lowerable, coming from SubstrateVM. Could this be related to moving the barrier addition phase from mid- to low-tier? I don't think I have changed anything in SerialWriteBarrier or related code.

rkennke avatar May 20 '25 15:05 rkennke

Yes this is something I mentioned in our slack discussions. Moving WriteBarrierAdditionPhase after LowTierLoweringPhase creates problems for GCs that still use snippets since they expect to be lowered by LowTierLoweringPhase. In the long term I would like it to be done there but I'm not sure how to bridge the gap. I've got an internal PR based on your branch that I'm testing and I was going to look into this.

The options are to conditionalize the placement of WriteBarrierAdditionPhase based on methods from BarrierSet but that's not available when constructing the suites. We could have early and late WriteBarrierAdditionPhase to handle each case during the transition but that's a bit ugly to me. Or we could beef up WriteBarrierAdditionPhase to perform any required lowering itself, though that might be a bit complicated. It would also complicate stuff like BarrierSetVerificationPhase and some barrier elimination that's part of enterprise.

I'm going to try putting appendPhase(new PlaceholderPhase<>(WriteBarrierAdditionPhase.class)); in both MidTier and LowTier and do the placeholder replacement based on the BarrierSet.

tkrodriguez avatar May 20 '25 17:05 tkrodriguez

Yes this is something I mentioned in our slack discussions. Moving WriteBarrierAdditionPhase after LowTierLoweringPhase creates problems for GCs that still use snippets since they expect to be lowered by LowTierLoweringPhase. In the long term I would like it to be done there but I'm not sure how to bridge the gap. I've got an internal PR based on your branch that I'm testing and I was going to look into this.

The options are to conditionalize the placement of WriteBarrierAdditionPhase based on methods from BarrierSet but that's not available when constructing the suites. We could have early and late WriteBarrierAdditionPhase to handle each case during the transition but that's a bit ugly to me. Or we could beef up WriteBarrierAdditionPhase to perform any required lowering itself, though that might be a bit complicated. It would also complicate stuff like BarrierSetVerificationPhase and some barrier elimination that's part of enterprise.

I'm going to try putting appendPhase(new PlaceholderPhase<>(WriteBarrierAdditionPhase.class)); in both MidTier and LowTier and do the placeholder replacement based on the BarrierSet.

As far as I can see, it's only the SerialWriteBarrierNode which depends on snippets (is that right?). That should be relatively straightforward to implement without snippets and would look almost exactly like ShenandoahCardBarrierNode implementation - even a little simpler. I could work on implementing that, if you think that'd help.

rkennke avatar May 20 '25 17:05 rkennke

It would be easy to convert the HotSpot serial barrier to LIR but native image uses snippets for its barriers and its serial barrier is non-trivial. So we'll need to live with this mixed model for a little while I think. I'm beginning to think we might just need an early and late phase. I think the way barriers for vector writes work, we might have to do barrier addition for them before LowTierLowering, or at least before VectorLoweringPhase, which is before FixReadsPhase.

It might be too much to try to resolve all these issues in this PR. Since Shenandoah is currently HotSpot only maybe it would be best to special case its barrier insertion. I'll will play some more with this to see what would be best.

tkrodriguez avatar May 20 '25 18:05 tkrodriguez

It would be easy to convert the HotSpot serial barrier to LIR but native image uses snippets for its barriers and its serial barrier is non-trivial. So we'll need to live with this mixed model for a little while I think. I'm beginning to think we might just need an early and late phase. I think the way barriers for vector writes work, we might have to do barrier addition for them before LowTierLowering, or at least before VectorLoweringPhase, which is before FixReadsPhase.

It might be too much to try to resolve all these issues in this PR. Since Shenandoah is currently HotSpot only maybe it would be best to special case its barrier insertion. I'll will play some more with this to see what would be best.

I implemented barrier addition to trigger in both mid- and low-tier, and the BarrierSet implementation gets to choose which one (or both, if it wishes) is appropriate. This choice defaults to mid-tier, and can be overridden in implementations, like I did in ShenandoahBarrierSet. This way we get a clean gate :-)

rkennke avatar May 21 '25 09:05 rkennke

Thanks. I'll see whether that works ok in our full gate. I'd tried something slightly different but the StageFlags makes it hard to be super flexible about when these phases run. I might be tempted to keep only the phase that actually does the work in the final suite.

tkrodriguez avatar May 21 '25 15:05 tkrodriguez

Your fix works though I don't love some of the details. I'll just put comments on those places. I was able to get a clean internal gate with just minor changes in enterprise. I wasn't actually able to test Shenandoah because we don't have a labsjdk that includes it at the moment.

tkrodriguez avatar May 22 '25 18:05 tkrodriguez

Overall I think this looks good. I've only lightly reviewed the actual shenandoah parts since I don't really know anything about how the collector works. A high level JavaDoc comment on each of your newly added classes would be appreciated.

@davleopo @gergo- I think it's in good shape for review.

tkrodriguez avatar May 22 '25 18:05 tkrodriguez

Thanks very nice work, looks pretty ready I did not see any blockers from my side. I added some style/unification comments. One question is if we should already add necessary SyncPort annotations to the jdk code in question to capture changes asap.

davleopo avatar May 28 '25 07:05 davleopo

I forgot to mention here is the SyncPort class we use it to denote hot spot ports. When the upstream implementation changes we are informed in our internal testing.

davleopo avatar May 28 '25 09:05 davleopo

I've started doing some gate testing and found one problem.

mx benchmark renaissance:finagle-http -- -XX:+UseShenandoahGC -Xmx8g

throws a lot of exceptions like this and never completes

[finagle/netty4-2-23] WARN io.netty.channel.ChannelOutboundBuffer - Failed to mark a promise as success because it has succeeded already: Http2CodecUtil$SimpleChannelPromiseAggregator@4e30b251(uncancellable)
[finagle/netty4-2-23] WARN io.netty.util.concurrent.DefaultPromise - An exception was thrown by io.netty.handler.codec.http2.AbstractHttp2StreamChannel$Http2ChannelUnsafe$3.operationComplete()
java.lang.IllegalStateException: complete already: DefaultChannelPromise@23f179c2(uncancellable)
	at io.netty.util.concurrent.DefaultPromise.setSuccess(DefaultPromise.java:100)
	at io.netty.channel.DefaultChannelPromise.setSuccess(DefaultChannelPromise.java:78)
	at io.netty.channel.DefaultChannelPromise.setSuccess(DefaultChannelPromise.java:73)
	at io.netty.handler.codec.http2.AbstractHttp2StreamChannel$Http2ChannelUnsafe.writeComplete(AbstractHttp2StreamChannel.java:1084)
	at io.netty.handler.codec.http2.AbstractHttp2StreamChannel$Http2ChannelUnsafe.access$2100(AbstractHttp2StreamChannel.java:630)
	at io.netty.handler.codec.http2.AbstractHttp2StreamChannel$Http2ChannelUnsafe$3.operationComplete(AbstractHttp2StreamChannel.java:1061)
	at io.netty.handler.codec.http2.AbstractHttp2StreamChannel$Http2ChannelUnsafe$3.operationComplete(AbstractHttp2StreamChannel.java:1055)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)

tkrodriguez avatar May 29 '25 15:05 tkrodriguez

mx unittest -XX:+UseShenandoahGC jdk.graal.compiler.hotspot.test.GCBarrierEmissionTest shows several problems. On aarch there's a cbnz with size 8 but it should be 32 I think.

java.lang.AssertionError: 8
	at jdk.graal.compiler/jdk.graal.compiler.asm.aarch64.AArch64MacroAssembler.cbnz(AArch64MacroAssembler.java:1768)
	at jdk.graal.compiler/jdk.graal.compiler.hotspot.aarch64.shenandoah.AArch64HotSpotShenandoahLoadRefBarrierOp.lambda$emitCode$0(AArch64HotSpotShenandoahLoadRefBarrierOp.java:192)
	at jdk.graal.compiler/jdk.graal.compiler.lir.LIRInstruction$LIRInstructionSlowPath.emitSlowPathCode(LIRInstruction.java:78)
	at jdk.graal.compiler/jdk.graal.compiler.lir.asm.CompilationResultBuilder.emitSlowPath(CompilationResultBuilder.java:589)

and jdk.graal.compiler.hotspot.test.GCBarrierEmissionTest#getAndSetBarrier produces an unschedulable graph.

This test might be useful for you to look more closely at. I'd originally written it as a testbed for the ZGC port. It's designed so that you can compare the output of C2 and Graal for all the core barrier patterns. See https://github.com/oracle/graal/blob/master/compiler/src/jdk.graal.compiler.test/src/jdk/graal/compiler/hotspot/test/GCBarrierEmissionTest.java#L53. I have a little utility to split each nmethod dump into it's own file so you can do pairwise comparison if you are interested in that.

tkrodriguez avatar May 29 '25 16:05 tkrodriguez

One ugly missing piece is support for the fast truffle dispatch in the truffle implementation of EntryPointDecorator.emitEntryPoint. That code performs an explicit read of an obj in the method prologue assembly, so it might need special handling. See https://github.com/oracle/graal/blob/master/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/truffle/hotspot/amd64/AMD64TruffleCallBoundaryInstrumentationFactory.java#L59 and https://github.com/oracle/graal/blob/master/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/truffle/hotspot/aarch64/AArch64TruffleCallBoundaryInstrumentationFactory.java#L58. Hopefully it's fairly easy to support that. We hope to eliminate that extra read in https://bugs.openjdk.org/browse/JDK-8358006

tkrodriguez avatar May 29 '25 16:05 tkrodriguez

Thanks for updates. There are various style problems that need to be fixed. Unused imports, guarantee should use %s and missing copyrights. This is my script to run the minimal style check locally in parallel which speeds up the whole process

#!/bin/bash
export ECLIPSE_EXE=/Users/tkrodrig/Downloads/eclipse-4.26/Eclipse.app/Contents/MacOS/eclipse
set -e

# Use tmp build output for ecj so we don't have to recompile
tmpecjdir=/tmp/mxgatestyle-ecj.$$
trap "rm -fr $tmpecjdir" 2 0

# ensure it's built first
mx build
cat <<EOF | parallel -j 5 ::::
mx gate --task-filter SpotBugs
JDT=builtin MX_ALT_OUTPUT_ROOT=$tmpecjdir mx gate --task-filter BuildWithEcj
mx gate --task-filter CodeFormatCheck
mx gate --task-filter Checkstyle
mx unittest CheckGraal
EOF

The CodeFormatCheck step requires a downloaded eclipse 4.26 binary to work but you can probably skip that. The BuildWithEcj runs separately because we don't want to overwrite the javac compiled files.

tkrodriguez avatar Jun 05 '25 04:06 tkrodriguez

@tkrodriguez The finagle-http benchmark failed because of missing stuff in the CAS-ref barriers. I implemented those parts now. I also took a stab at the truffle thing you mentioned, at least on arm64, but I am not even sure that it is correct. Is there any way to test it? The x86 part of that is still missing, and I'll be off to vacation for the next two weeks. I'll try to squeeze it in later tonight, or maybe tomorrow, but if I can't, it'll have to wait. Same for the gate failures (at least part of them are waiting for https://github.com/openjdk/jdk/pull/25552 from OpenJDK upstream anyway).

rkennke avatar Jun 06 '25 16:06 rkennke

No rush on any of this, but I just wanted keep you posted on what I'm seeing with our testing.

As far as the truffle part, you can at least minimally exercise it with the truffle unit tests in the compiler suite. More extensive testing requires running the language tests. I've got tasks in our CI to exercise shenandoah that will test that out. So you can make an initial implementation that passes the basic unit tests and we can stress it in our gate.

The implementation should just be a directly invokeable entry point into the normal load barrier LIR op. So it should be mostly just be refactoring the op and maybe some manual tmp selection that's compatible with the live registers on entry. I can help you work out any kinks. It's an annoying part of the implementation.

tkrodriguez avatar Jun 06 '25 18:06 tkrodriguez

No rush on any of this, but I just wanted keep you posted on what I'm seeing with our testing.

As far as the truffle part, you can at least minimally exercise it with the truffle unit tests in the compiler suite. More extensive testing requires running the language tests. I've got tasks in our CI to exercise shenandoah that will test that out. So you can make an initial implementation that passes the basic unit tests and we can stress it in our gate.

The implementation should just be a directly invokeable entry point into the normal load barrier LIR op. So it should be mostly just be refactoring the op and maybe some manual tmp selection that's compatible with the live registers on entry. I can help you work out any kinks. It's an annoying part of the implementation.

I made a blind/best-effort implementation for the Truffle frame-setup parts, based on what I could understand from the context and the ZGC implementation. It's pretty straightforward. An open question in the x86 parts is: I need two temporary registers, and I am not sure which ones would be free to use. I used r9 and r11, they are caller-saved in normal calling conventions, but I am not sure if this is the correct assumption in that code. Also, I haven't tested it, yet. Let me know if you want anything else changed. I might not get to it before June 23, though...

rkennke avatar Jun 06 '25 19:06 rkennke

@tkrodriguez @mur47x111 @dougxc @davleopo I merged latest master, the missing symbol problem disappeared, as expected. Then I fixed a problem with a unit-test complaining about guarantee() calls with string concats, which made the gate tests clean. I also squashed all changesets into a single commit (again). I think I addressed all review comments so far, it should be ready for re-review now.

rkennke avatar Jun 24 '25 14:06 rkennke

Thanks. I'll launch our internal gate and see where we stand on testing.

tkrodriguez avatar Jun 24 '25 15:06 tkrodriguez

@tkrodriguez what is the status of this testing?

rkennke avatar Jul 07 '25 17:07 rkennke

Mostly the gate looks good. There are problems with the amd64 truffle barrier that I'm fixing. I don't know if I can push here but I can provide the patch once it's working if not. Then I can launch the full benchmark suites too.

tkrodriguez avatar Jul 08 '25 15:07 tkrodriguez

Mostly the gate looks good. There are problems with the amd64 truffle barrier that I'm fixing. I don't know if I can push here but I can provide the patch once it's working if not. Then I can launch the full benchmark suites too.

Ok, thank you! What is the status of it? If you have a patch, I will happily intergrate it into this PR.

rkennke avatar Jul 17 '25 13:07 rkennke

@tkrodriguez @thomaswue ping - what is the status of this PR? I'm waiting for a fix for aarch64/truffle from Tom, and it also is waiting for an approval and hopefully can be integrated soon (in time for jdk25?)

rkennke avatar Jul 30 '25 08:07 rkennke