jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8291237: Encapsulate nmethod Deoptimization logic

Open xmas92 opened this issue 3 years ago • 15 comments

The proposal is to encapsulate the nmethod mark for deoptimization logic in one place and only allow access to the mark_for_deoptimization from a closure object:

class DeoptimizationMarkerClosure : StackObj {
public:
  virtual void marker_do(Deoptimization::MarkFn mark_fn) = 0;
};

This closure takes a MarkFn which it uses to mark which nmethods should be deoptimized. This marking can only be done through the MarkFn and a MarkFn can only be created in the following code which runs the closure.

{
  NoSafepointVerifier nsv;
  assert_locked_or_safepoint(Compile_lock);
  marker_closure.marker_do(MarkFn());
  anything_deoptimized = deoptimize_all_marked();
}
if (anything_deoptimized) {
  run_deoptimize_closure();
}

This ensures that this logic is encapsulated and the NoSafepointVerifier and assert_locked_or_safepoint(Compile_lock) makes deoptimize_all_marked not having to scan the whole code cache sound.

The exception to this pattern, from InstanceKlass::unload_class, is discussed in the JBS issue, and gives reasons why not marking for deoptimization there is ok.

An effect of this encapsulation is that the deoptimization logic was moved from the CodeCache class to the Deoptimization class and the class redefinition logic was moved from the CodeCache class to the VM_RedefineClasses class/operation.

Testing: Tier 1-5

Update

Switched too using a RAII object to track the context instead of putting code in a closure. But all the encapsulation is still the same.

Testing: Tier 1-7

Update

@stefank suggested splitting out unloading klass logic change into a separate issue JDK-8291718.

Will probably also limit this PR to only encapsulation. (Skipping the linked list optimisation) And create a separate issue for that as well.

But this creates a chain of three dependent issues. JDK-8291237 depends on JDK-8291718. And the link list optimisation depend will depend on JDK-8291237.

Will mark this as a draft for now and create a PR for JDK-8291718 first.


Progress

  • [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Issue

  • JDK-8291237: Encapsulate nmethod Deoptimization logic

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9655/head:pull/9655
$ git checkout pull/9655

Update a local copy of the PR:
$ git checkout pull/9655
$ git pull https://git.openjdk.org/jdk pull/9655/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9655

View PR using the GUI difftool:
$ git pr show -t 9655

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9655.diff

xmas92 avatar Jul 27 '22 12:07 xmas92

:wave: Welcome back xmas92! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Jul 27 '22 12:07 bridgekeeper[bot]

@xmas92 The following labels will be automatically applied to this pull request:

  • hotspot
  • serviceability

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Jul 27 '22 13:07 openjdk[bot]

@fisk suggested using a RAII context object instead of a closure to guarantee the encapsulated invariants. Testing the patch now and will push when done. Not having to create a closure everywhere makes the code less verbose and more readable.

xmas92 avatar Jul 28 '22 15:07 xmas92

@stefank suggested splitting out unloading klass logic change into a separate issue JDK-8291718.

Will probably also limit this PR to only encapsulation. (Skipping the linked list optimisation) And create a separate issue for that as well.

But this creates a chain of three dependent issues. JDK-8291237 depends on JDK-8291718. And the link list optimisation depend will depend on JDK-8291237.

Will mark this as a draft for now and create a PR for JDK-8291718 first.

xmas92 avatar Aug 02 '22 14:08 xmas92

Now that you are building a separate list of nmethods to deoptimize, what protects that list from nmethods being recycled by other threads? Don't you need to hold the CodeCache_lock? And when an nmethod is destroyed how about asserting that _root_mark_link == NULL and _mark_link == NULL?

dean-long avatar Aug 02 '22 21:08 dean-long

As my previous comment will probably back out the linked list of this PR and create a separate issue that depends on this encapsulation that only deal with swapping to a linked list instead of walking the whole code cache.

But as it will be based on this code your question is still relevant.

I am pretty new to the codebase so please correct me if I am wrong, especially with regards to different locks, their scope and usage. Here is my understanding of why the linked list is valid as is.

While working with the list we have a few things which are true:

  • assert_locked_or_safepoint(Compile_lock); anywhere were we modify the list,
  • once something is linked it cannot be linked again, as it is only linked if it is not_marked and the status can only go not_marked -> deoptimize | deoptimize_noupdate -> deoptimize_done.
    • assert(extract_compiled_method(_mark_link) == nullptr, ...) for some extra sanity around this
    • While the MarkForDeoptimizationStatus value makes linking something already linked impossible the assert above is not enough to catch if a tail of the list is linked in creating a cycle. Can either add an assert that walks the list and checks that the tail != this, or the next_marked() code can be changed to take_next_marked() which sets the field to nullptr and thus breaks any cycles when iterating (avoids infinite loops). The second would be alright as creating a loop from the tail to the root will never drop parts of the list, even if more elements are added after the cycle is created, as the assert(extract_compiled_method(_mark_link) == nullptr, ...) assert that the list will not break.
  • From the creation of a DeoptimizationContext to until the linked list has been processed (make_not_entrant and make_deoptimized) we have NoSafepointVerifier (which is the time to safe point issue this change is trying to address)
  • And this active part of the DeoptimizationContext cannot overlap with another active part of another DeoptimizationContext, checked via a bool flag. Maybe it is not correct to have a non volatile static bool here, but I though assert_locked_or_safepoint(Compile_lock); would be enough to guarantee synchronisation. But should probably change to load acquire atomics.

I think most of the terminology will change to make it more deoptimization specific. Having it called not_marked, _root_mark_link, _mark_link, next_mark() and take_root() is to general and can lead to confusion. Another reason to split this out to a separate issue. Also using some other terminology than mark could be used.

xmas92 avatar Aug 03 '22 13:08 xmas92

It's nmethod::flush() I'm worried about. It uses CodeCache_lock, not Compile_lock. It calls CodeCache::free() to destroy the nmethod.

dean-long avatar Aug 03 '22 19:08 dean-long

The nmethods we link together are is_alive. There is also no safepoint. It would be impossible for such nmethods to be flushed without a safepoint. It could racingly flip to zombie, but it would need yet another thread-local handshake from the sweeper for it to flush the nmethods. That can not happen without safepoints in this code path.

fisk avatar Aug 03 '22 20:08 fisk

Also, in DeoptimizationContext::deopt_compiled_methods, the SweeperBlocker completely blocks out the sweeper from running. But as I mentioned, even without that, without safepoint checks, we can never flush these things. It's worth mentioning that there used to be a special case for OSR nmethods that they could be flushed immediately and skip the zombie step. But I removed that a few tears ago so we wouldn't have to think about that pathological case separately.

fisk avatar Aug 03 '22 20:08 fisk

@fisk, is there any chance that in the future we might figure out how to allow nmethods to be flushed without a safepoint? Maybe with handshakes or some other clever improvement? If no, then maybe a comment saying so would help. If yes, then are there some asserts we could add that would catch such a change?

dean-long avatar Aug 05 '22 02:08 dean-long

@fisk, is there any chance that in the future we might figure out how to allow nmethods to be flushed without a safepoint? Maybe with handshakes or some other clever improvement? If no, then maybe a comment saying so would help. If yes, then are there some asserts we could add that would catch such a change?

Yes. With concurrent class unloading, nmethods are unloaded during concurrent execution. After the sweeper removal, this is done in 2 phases:

  1. We unlink() the nmethod
  2. A thread-local handshake with all JavaThreads (who might have gotten a reference to it before unlinking)
  3. We flush() the nmethod

fisk avatar Aug 05 '22 05:08 fisk

If the nmethod shouldn't be on the marked list at certain state transitions, how about adding asserts that check this? Using the same "self looped" trick that the Sweeper removal PR uses might help with that.

dean-long avatar Aug 08 '22 20:08 dean-long

⚠️ @xmas92 This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

openjdk[bot] avatar Aug 10 '22 15:08 openjdk[bot]

Merged JDK-8291718 into this as this change is dependent on that patch. Backed out the linked list enhancement. This is still a draft, and still need some work on the semantics of some enum statuses, and naming.

Changed the terminology for deoptimization to stop using mark. Because the steps for deoptimization is always the following:

  1. Create a list of nmethods to deoptimize (enqueueing)
  2. Process the list 2.1. Make nmethod non entrant 2.2. (With continuations) patch nmethod NativePostCallNop instructions
  3. Walk all frames and patch return pcs

The mark_for was renamed enqueue, this makes it easier to talk about deoptimization in a context where marking is already a thing (e.g. GCs) and is more correct with regards to what the codes actually does.

All remaining changes are with regards to deoptimize_done and that it does not really mean what it says. The comment in update_recompile_counts() describes the issue. The most correct thing would be to rename it to something like deoptimize_post_call_installed or deoptimize_post_call_patched and just leave it like that. No code currently cares if the deoptimization logic is completed (with regards to stack frame walking), only if an nmethod has enqueued deoptimization or not, and if NativePostCallNop instructions has been installed for loom.

xmas92 avatar Aug 10 '22 15:08 xmas92

Opening this PR for review.

Terminology for deoptimizing compiled methods has been updated and the process encapsulated.

The terminology was chosen to better reflect what is happening and not clash with other terminology used when working with compiled methods.

The encapsulation was done with a RAII object which enables more rigorous assertion of the usage and behaviour of the code.

The fix for the deviation mentioned in the issue and the comments above was address in JDK-8291718 and has been merged.

I made some comments about moving out the linked list optimisation into a separate issue, however there are two tests that currently are sensitive to (I suspect) asserts inside the critical sections of the deoptimization. The tests are:

  • vmTestbase/vm/mlvm/indy/stress/java/volatileCallSiteDekker/TestDescription.java
  • vmTestbase/vm/mlvm/indy/stress/java/mutableCallSiteDekker/TestDescription.java

They will (most of the time) timeout on linux debug builds, but succeed with a larger timeoutFactor. I do not yet have access to a linux machine to debug and investigate more thoroughly, but I suspect that, as this test just keep rebinding CallSites in a loop, they trigger a lot of deoptimizations which creates a lot of contention on the Compile_lock and CodeCache_lock, and the longer critical sections (due to more asserts) increase the probability of contention slowdowns.

With the bea67de the timeout problem goes away as we are not walking the CodeCache when we do not need to. The unnecessary CodeCache walking time to safepoint issue is what motivated the encapsulation in the first place.

Testing: Oracle Platforms Tier 1-5

xmas92 avatar Aug 18 '22 08:08 xmas92

@xmas92 this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8291237
git fetch https://git.openjdk.org/jdk master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

openjdk[bot] avatar Aug 29 '22 08:08 openjdk[bot]

Just an update and a call for reviews.

JDK-8290025 / #9741 was merged in 030ed7e which makes this change easier to reson about as the lifetime of nmethods are simpler. The merge was tested on oracle platforms tier 1-5.

xmas92 avatar Sep 09 '22 06:09 xmas92

@xmas92 This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar Oct 07 '22 07:10 bridgekeeper[bot]

@xmas92 This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

bridgekeeper[bot] avatar Nov 04 '22 18:11 bridgekeeper[bot]