jdk
jdk copied to clipboard
8291237: Encapsulate nmethod Deoptimization logic
The proposal is to encapsulate the nmethod mark for deoptimization logic in one place and only allow access to the mark_for_deoptimization from a closure object:
class DeoptimizationMarkerClosure : StackObj {
public:
virtual void marker_do(Deoptimization::MarkFn mark_fn) = 0;
};
This closure takes a MarkFn which it uses to mark which nmethods should be deoptimized. This marking can only be done through the MarkFn and a MarkFn can only be created in the following code which runs the closure.
{
NoSafepointVerifier nsv;
assert_locked_or_safepoint(Compile_lock);
marker_closure.marker_do(MarkFn());
anything_deoptimized = deoptimize_all_marked();
}
if (anything_deoptimized) {
run_deoptimize_closure();
}
This ensures that this logic is encapsulated and the NoSafepointVerifier and assert_locked_or_safepoint(Compile_lock) makes deoptimize_all_marked not having to scan the whole code cache sound.
The exception to this pattern, from InstanceKlass::unload_class, is discussed in the JBS issue, and gives reasons why not marking for deoptimization there is ok.
An effect of this encapsulation is that the deoptimization logic was moved from the CodeCache class to the Deoptimization class and the class redefinition logic was moved from the CodeCache class to the VM_RedefineClasses class/operation.
Testing: Tier 1-5
Update
Switched too using a RAII object to track the context instead of putting code in a closure. But all the encapsulation is still the same.
Testing: Tier 1-7
Update
@stefank suggested splitting out unloading klass logic change into a separate issue JDK-8291718.
Will probably also limit this PR to only encapsulation. (Skipping the linked list optimisation) And create a separate issue for that as well.
But this creates a chain of three dependent issues. JDK-8291237 depends on JDK-8291718. And the link list optimisation depend will depend on JDK-8291237.
Will mark this as a draft for now and create a PR for JDK-8291718 first.
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8291237: Encapsulate nmethod Deoptimization logic
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9655/head:pull/9655
$ git checkout pull/9655
Update a local copy of the PR:
$ git checkout pull/9655
$ git pull https://git.openjdk.org/jdk pull/9655/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 9655
View PR using the GUI difftool:
$ git pr show -t 9655
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9655.diff
:wave: Welcome back xmas92! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
@xmas92 The following labels will be automatically applied to this pull request:
hotspotserviceability
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.
Webrevs
- 08: Full (c35f5ed6)
- 07: Full (11d9dd24)
- 06: Full - Incremental (ab144688)
- 05: Full - Incremental (4b95396c)
- 04: Full - Incremental (bea67dea)
- 03: Full (2be7b152)
- 02: Full - Incremental (9f6f981d)
- 01: Full - Incremental (5d26ab8d)
- 00: Full (f77f44eb)
@fisk suggested using a RAII context object instead of a closure to guarantee the encapsulated invariants. Testing the patch now and will push when done. Not having to create a closure everywhere makes the code less verbose and more readable.
@stefank suggested splitting out unloading klass logic change into a separate issue JDK-8291718.
Will probably also limit this PR to only encapsulation. (Skipping the linked list optimisation) And create a separate issue for that as well.
But this creates a chain of three dependent issues. JDK-8291237 depends on JDK-8291718. And the link list optimisation depend will depend on JDK-8291237.
Will mark this as a draft for now and create a PR for JDK-8291718 first.
Now that you are building a separate list of nmethods to deoptimize, what protects that list from nmethods being recycled by other threads? Don't you need to hold the CodeCache_lock? And when an nmethod is destroyed how about asserting that _root_mark_link == NULL and _mark_link == NULL?
As my previous comment will probably back out the linked list of this PR and create a separate issue that depends on this encapsulation that only deal with swapping to a linked list instead of walking the whole code cache.
But as it will be based on this code your question is still relevant.
I am pretty new to the codebase so please correct me if I am wrong, especially with regards to different locks, their scope and usage. Here is my understanding of why the linked list is valid as is.
While working with the list we have a few things which are true:
assert_locked_or_safepoint(Compile_lock);anywhere were we modify the list,- once something is linked it cannot be linked again, as it is only linked if it is
not_markedand the status can only gonot_marked -> deoptimize | deoptimize_noupdate -> deoptimize_done.assert(extract_compiled_method(_mark_link) == nullptr, ...)for some extra sanity around this- While the
MarkForDeoptimizationStatusvalue makes linking something already linked impossible the assert above is not enough to catch if a tail of the list is linked in creating a cycle. Can either add an assert that walks the list and checks that thetail != this, or thenext_marked()code can be changed totake_next_marked()which sets the field tonullptrand thus breaks any cycles when iterating (avoids infinite loops). The second would be alright as creating a loop from the tail to the root will never drop parts of the list, even if more elements are added after the cycle is created, as theassert(extract_compiled_method(_mark_link) == nullptr, ...)assert that the list will not break.
- From the creation of a
DeoptimizationContextto until the linked list has been processed (make_not_entrantandmake_deoptimized) we haveNoSafepointVerifier(which is the time to safe point issue this change is trying to address) - And this active part of the DeoptimizationContext cannot overlap with another active part of another DeoptimizationContext, checked via a bool flag. Maybe it is not correct to have a non volatile static bool here, but I though
assert_locked_or_safepoint(Compile_lock);would be enough to guarantee synchronisation. But should probably change to load acquire atomics.
I think most of the terminology will change to make it more deoptimization specific. Having it called not_marked, _root_mark_link, _mark_link, next_mark() and take_root() is to general and can lead to confusion. Another reason to split this out to a separate issue. Also using some other terminology than mark could be used.
It's nmethod::flush() I'm worried about. It uses CodeCache_lock, not Compile_lock. It calls CodeCache::free() to destroy the nmethod.
The nmethods we link together are is_alive. There is also no safepoint. It would be impossible for such nmethods to be flushed without a safepoint. It could racingly flip to zombie, but it would need yet another thread-local handshake from the sweeper for it to flush the nmethods. That can not happen without safepoints in this code path.
Also, in DeoptimizationContext::deopt_compiled_methods, the SweeperBlocker completely blocks out the sweeper from running. But as I mentioned, even without that, without safepoint checks, we can never flush these things. It's worth mentioning that there used to be a special case for OSR nmethods that they could be flushed immediately and skip the zombie step. But I removed that a few tears ago so we wouldn't have to think about that pathological case separately.
@fisk, is there any chance that in the future we might figure out how to allow nmethods to be flushed without a safepoint? Maybe with handshakes or some other clever improvement? If no, then maybe a comment saying so would help. If yes, then are there some asserts we could add that would catch such a change?
@fisk, is there any chance that in the future we might figure out how to allow nmethods to be flushed without a safepoint? Maybe with handshakes or some other clever improvement? If no, then maybe a comment saying so would help. If yes, then are there some asserts we could add that would catch such a change?
Yes. With concurrent class unloading, nmethods are unloaded during concurrent execution. After the sweeper removal, this is done in 2 phases:
- We unlink() the nmethod
- A thread-local handshake with all JavaThreads (who might have gotten a reference to it before unlinking)
- We flush() the nmethod
If the nmethod shouldn't be on the marked list at certain state transitions, how about adding asserts that check this? Using the same "self looped" trick that the Sweeper removal PR uses might help with that.
⚠️ @xmas92 This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).
Merged JDK-8291718 into this as this change is dependent on that patch. Backed out the linked list enhancement. This is still a draft, and still need some work on the semantics of some enum statuses, and naming.
Changed the terminology for deoptimization to stop using mark. Because the steps for deoptimization is always the following:
- Create a list of nmethods to deoptimize (enqueueing)
- Process the list
2.1. Make
nmethodnon entrant 2.2. (With continuations) patchnmethodNativePostCallNopinstructions - Walk all frames and patch return pcs
The mark_for was renamed enqueue, this makes it easier to talk about deoptimization in a context where marking is already a thing (e.g. GCs) and is more correct with regards to what the codes actually does.
All remaining changes are with regards to deoptimize_done and that it does not really mean what it says. The comment in update_recompile_counts() describes the issue. The most correct thing would be to rename it to something like deoptimize_post_call_installed or deoptimize_post_call_patched and just leave it like that. No code currently cares if the deoptimization logic is completed (with regards to stack frame walking), only if an nmethod has enqueued deoptimization or not, and if NativePostCallNop instructions has been installed for loom.
Opening this PR for review.
Terminology for deoptimizing compiled methods has been updated and the process encapsulated.
The terminology was chosen to better reflect what is happening and not clash with other terminology used when working with compiled methods.
The encapsulation was done with a RAII object which enables more rigorous assertion of the usage and behaviour of the code.
The fix for the deviation mentioned in the issue and the comments above was address in JDK-8291718 and has been merged.
I made some comments about moving out the linked list optimisation into a separate issue, however there are two tests that currently are sensitive to (I suspect) asserts inside the critical sections of the deoptimization. The tests are:
vmTestbase/vm/mlvm/indy/stress/java/volatileCallSiteDekker/TestDescription.javavmTestbase/vm/mlvm/indy/stress/java/mutableCallSiteDekker/TestDescription.java
They will (most of the time) timeout on linux debug builds, but succeed with a larger timeoutFactor. I do not yet have access to a linux machine to debug and investigate more thoroughly, but I suspect that, as this test just keep rebinding CallSites in a loop, they trigger a lot of deoptimizations which creates a lot of contention on the Compile_lock and CodeCache_lock, and the longer critical sections (due to more asserts) increase the probability of contention slowdowns.
With the bea67de the timeout problem goes away as we are not walking the CodeCache when we do not need to. The unnecessary CodeCache walking time to safepoint issue is what motivated the encapsulation in the first place.
Testing: Oracle Platforms Tier 1-5
@xmas92 this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:
git checkout JDK-8291237
git fetch https://git.openjdk.org/jdk master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push
Just an update and a call for reviews.
JDK-8290025 / #9741 was merged in 030ed7e which makes this change easier to reson about as the lifetime of nmethods are simpler. The merge was tested on oracle platforms tier 1-5.
@xmas92 This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
@xmas92 This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.