quarkus
quarkus copied to clipboard
Replace read/write lock in JarResource to avoid virtual threads pinning
This is a proposed solution for the problem reported here.
I'm avoiding blocking on the ReadWriteLock and emulating its behavior with a state machine + an atomic readers counter because the zipFile access may happen inside a native call (for example triggered by the RunnerClassLoader) and then it is necessary to avoid blocking on it.
Note that while avoid blocking and thus preventing pinning, I'm afraid that this solution could suffer of monopolization. I tried to mitigate the problem with the yields at lines 160 and 183 but I'm not sure if this is a good idea, especially with virtual threads. Any feedback or suggestion to improve is welcome.
/cc @franz1981 @geoand @Sanne @dmlloyd @cescoffier
Thanks for your pull request!
The title of your pull request does not follow our editorial rules. Could you have a look?
- title should preferably start with an uppercase character (if it makes sense!)
This message is automatically generated by a bot.
:waning_crescent_moon: This workflow status is outdated as a new workflow run has been triggered.
Status for workflow Quarkus CI
This is the status report for running Quarkus CI on commit c2c8c189be75ce3b30c016ad0d3d8e9e7666cf7a.
Failing Jobs
| Status | Name | Step | Failures | Logs | Raw logs | Build scan |
|---|---|---|---|---|---|---|
| ✖ | Initial JDK 17 Build | Build |
Failures | Logs | Raw logs | :mag: |
You can consult the Develocity build scans.
Failures
:gear: Initial JDK 17 Build #
- Failing: independent-projects/bootstrap/runner
! Skipped: core/deployment core/runtime devtools/bom-descriptor-json and 844 more
:package: independent-projects/bootstrap/runner
✖ Failed to execute goal net.revelc.code.formatter:formatter-maven-plugin:2.24.0:validate (default) on project quarkus-bootstrap-runner: File '/home/runner/work/quarkus/quarkus/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/JarResource.java' has not been previously formatted. Please format file (for example by invoking `mvn -f independent-projects/bootstrap/runner net.revelc.code.formatter:formatter-maven-plugin:2.24.0:format`) and commit before running validation!
✖
Failed to execute goal net.revelc.code.formatter:formatter-maven-plugin:2.24.0:validate (default) on project quarkus-bootstrap-runner: File '/home/runner/work/quarkus/quarkus/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/JarResource.java' has not been previously formatted. Please format file (for example by invoking `mvn -f independent-projects/bootstrap/runner net.revelc.code.formatter:formatter-maven-plugin:2.24.0:format`) and commit before running validation!
I tried to format the code using that maven command, but when I run it, it doesn't change anything in the java source (or in any other file). Am I missing something?
I would make a separate RefCntResourceFile class which contains a volatile field pointing to the file to create and an atomic ref count to handle ownership and eventually the close (the release which put the ref can't back to zero got the rights to close the file)
In the JarResource I would have an AtomicReference<RefCntResource> which can be in the states null <-> not null, but thanks to RefCntResource being a new instance on each opening transition, would avoid ABA issues.
The algorithm can be made more cooperative and progressive by allowing contended operations to "help" eg concurrent opens can compete to populate the volatile field in the currently set RefCntResourceFile, avoiding to spin wait, while allocating more memory.
Similarly a close doesn't need to wait till all readers are gone but can perform a RefCntResourceFile::release which if true will make it to attempt moving to null the shared AtomicReference<RefCntResource>, while it false, will likely be picked by the reader while finishing to read (each read must be surrounded by a retain and release on the current RefCntResource).
✖
Failed to execute goal net.revelc.code.formatter:formatter-maven-plugin:2.24.0:validate (default) on project quarkus-bootstrap-runner: File '/home/runner/work/quarkus/quarkus/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/JarResource.java' has not been previously formatted. Please format file (for example by invoking `mvn -f independent-projects/bootstrap/runner net.revelc.code.formatter:formatter-maven-plugin:2.24.0:format`) and commit before running validation!I tried to format the code using that maven command, but when I run it, it doesn't change anything in the java source (or in any other file). Am I missing something?
https://quarkusio.zulipchat.com/#narrow/stream/187038-dev/topic/source.20code.20not.20formatted seems to be relevant
I reworked this pull request following @franz1981's suggestions. Please review again.
@franz1981 I tried to follow all your suggestions and I believe I reached a far better level of encapsulation in the JarFileReference with the introduction of a non-capturing lambda that has the purpose of consuming the jarFile hiding all the complexities of retrieving a valid non closed instance and keeping track of the readers counting. The only thing that I haven't been able to remove is the need of that boolean releasing and I tried to explain why I believe that it is still necessary. Please review again :pray:
Also, when all else is done, we'll need to squash the commits
@franz1981 I think I've finally understood what you meant (I'm getting old) and implemented the reference counter accordingly. Please review again.
Given the amount of work and discussions related to this issue, could we have a simple revert PR targeting 3.11?
Whatever gets into 3.12 wouldn't be backportable.
Thanks.
Given the amount of work and discussions related to this issue, could we have a simple revert PR targeting 3.11?
+1
Given the amount of work and discussions related to this issue, could we have a simple revert PR targeting 3.11?
Whatever gets into 3.12 wouldn't be backportable.
Do you mean we will have a 3.11.1 patch release where we simply revert the use of registerAsParallelCapable() on the RunnerClassLoader while we will have this fix (when it will be ready, we should be almost there) on the 3.12.0? If so, that's ok for me.
Yeah, @gsmet is proposing reverting #40033 in the 3.11 branch - which would be done by opening a PR against that branch specifically
Also, when all else is done, we'll need to squash the commits
Bumping this so it does not get lost :)
Yeah, @gsmet is proposing reverting https://github.com/quarkusio/quarkus/pull/40033 in the 3.11 branch - which would be done by opening a PR against that branch specifically
For what I have experimented these days, I don't think that a full revert will be necessary, but it would be sufficient to just remove the statement calling registerAsParallelCapable() on the RunnerClassLoader, anyway up to you.
Also, when all else is done, we'll need to squash the commits
Bumping this so it does not get lost :)
That's ok, I didn't forget it, I just wanted to finish discussing all the details with @franz1981 before squashing :)
For what I have experimented these days, I don't think that a full revert will be necessary, but it would be sufficient to just remove the statement calling registerAsParallelCapable() on the RunnerClassLoader, anyway up to you.
Yeah, that's almost certainly the case, but I would say that it's safest to go back to the exact previous state if we are not going to include this
:waning_crescent_moon: This workflow status is outdated as a new workflow run has been triggered.
Status for workflow Quarkus CI
This is the status report for running Quarkus CI on commit 198ff4941fccc82d793fe7829a6c7609bbfe9bc5.
Failing Jobs
| Status | Name | Step | Failures | Logs | Raw logs | Build scan |
|---|---|---|---|---|---|---|
| ✖ | Kubernetes Tests - JDK 17 | Build |
:warning: Check → | Logs | Raw logs | :construction: |
| :heavy_check_mark: | Kubernetes Tests - JDK 21 | Logs | Raw logs | :mag: | ||
| :heavy_check_mark: | Kubernetes Tests - JDK 17 Windows | Logs | Raw logs | :mag: |
You can consult the Develocity build scans.
Flaky tests - Develocity
:gear: JVM Tests - JDK 17
:package: integration-tests/reactive-messaging-kafka
✖ io.quarkus.it.kafka.KafkaConnectorTest.testFruits - History
Assertion condition defined as a Lambda expression in io.quarkus.it.kafka.KafkaConnectorTest expected: <6> but was: <5> within 10 seconds.-org.awaitility.core.ConditionTimeoutException
org.awaitility.core.ConditionTimeoutException: Assertion condition defined as a Lambda expression in io.quarkus.it.kafka.KafkaConnectorTest expected: <6> but was: <5> within 10 seconds.
at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119)
at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31)
at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:1006)
at org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:790)
at io.quarkus.it.kafka.KafkaConnectorTest.testFruits(KafkaConnectorTest.java:63)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
:gear: JVM Tests - JDK 21
:package: integration-tests/reactive-messaging-kafka
✖ io.quarkus.it.kafka.KafkaConnectorTest.testFruits - History
Assertion condition defined as a Lambda expression in io.quarkus.it.kafka.KafkaConnectorTest expected: <6> but was: <5> within 10 seconds.-org.awaitility.core.ConditionTimeoutException
org.awaitility.core.ConditionTimeoutException: Assertion condition defined as a Lambda expression in io.quarkus.it.kafka.KafkaConnectorTest expected: <6> but was: <5> within 10 seconds.
at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119)
at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31)
at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:1006)
at org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:790)
at io.quarkus.it.kafka.KafkaConnectorTest.testFruits(KafkaConnectorTest.java:63)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
Yeah, @gsmet is proposing reverting #40033 in the 3.11 branch - which would be done by opening a PR against that branch specifically
For what I have experimented these days, I don't think that a full revert will be necessary, but it would be sufficient to just remove the statement calling
registerAsParallelCapable()on theRunnerClassLoader, anyway up to you.Also, when all else is done, we'll need to squash the commits
Bumping this so it does not get lost :)
That's ok, I didn't forget it, I just wanted to finish discussing all the details with @franz1981 before squashing :)
I don't think that'd be sufficient; in addition to making the CL parallel-capable, it also removes locking on most of the class loading paths, which will continue to lead to the same issue. You'd have to also re-add the synchronization statements.
You can do it in a slightly more revertible way by doing things like this:
Class<?> loadClass(String name) {
// todo: remove
if (! Thread.holdsLock(getClassLoadingLock(name))) {
synchronized (getClassLoadingLock(name)) {
return loadClass(name);
}
}
// end todo
Then once the locking situation is fixed, just that block can be removed (instead of re-whitespacing the whole method).
I just created https://github.com/quarkusio/quarkus/pull/41120 for main.
Can you please squash the commits in this PR?
@franz1981 and I did some pair programming today and we heavily reworked and simplified the lock-free algorithm. Now we are much more confident on both correctness and performances. I also rebased and squashed all commits are requested.
@geoand @dmlloyd @gsmet Please give a look. At this point I believe that this would be a better solution than reverting the original commit that introduced the parallel classloader.
Given that this is a non trivial change, I think it would be best merging the revert so we can go with the old stuff in 3.12 (the CR of which will be cut tomorrow) and reviewing this more and merging into main after 3.12 has been branched.
:waning_crescent_moon: This workflow status is outdated as a new workflow run has been triggered.
Status for workflow Quarkus CI
This is the status report for running Quarkus CI on commit cfe603048a1229dd409724f1dd236504193e88f3.
Failing Jobs
| Status | Name | Step | Failures | Logs | Raw logs | Build scan |
|---|---|---|---|---|---|---|
| ✖ | Native Tests - Data1 | Build |
Failures | Logs | Raw logs | :construction: |
Failures
:gear: Native Tests - Data1 #
- Failing: integration-tests/jpa-mssql
:package: integration-tests/jpa-mssql
✖ Failed to execute goal io.fabric8:docker-maven-plugin:0.44.0:start (docker-start) on project quarkus-integration-test-jpa-mssql: I/O Error
This is now a super progressive algorithm...so progressive that it doesn't spin at all, but pay some memory footprint to avoid contention i.e. 2 racy readers end up having both to make progress but just one of the two has already won the race to cache the read jar file, while the other is going to close it while finished.
That sounds brilliant - what about allocations?
That sounds brilliant - what about allocations?
@Sanne that's why I've asked to you :)
If the problem of using lock(s) (can be either synchronized/Reentrant ones) within "critical code paths" like such (according to the Virtual Threads, actually) need to fixed there are not so many paths to fix it here:
- always spin/yield waits who is in charge to read the file, but that can make some very bad spikes in the cpu usage, which is not nice in the context of container(s) either
- have specialized behaviours based on the caller thread .i.e. virtual threads can waste more memory during a racy read, while platform threads can just "join" the current race, waiting - but this is not what the current PR does
Let's remember that if the current thread is a virtual thread we're not allowed, within a call stack with defineClass to perform any form of park/unpark (that's what locks would do - and the same CompletableFuture, too).
Being pragmatic, the expectation is that there shouldn't be many parallel class loading events (fingers crossed).
@franz1981 did you check that it doesn't have any unexpected adverse effect on startup performance?
Being pragmatic, the expectation is that there shouldn't be many parallel class loading events (fingers crossed).
I agree with that - still we want to be able to load a burst of classes sequentially without requiring a lot of additional memory: think of bootstrap and the challenge to not run out of memory during the initial bootstrap, which could be memory intensive for other reasons, in a constrained environment.
It does us no good to be able to perform well with 50MB of ram if we need significantly more to initialize.
did you check that it doesn't have any unexpected adverse effect on startup performance?
Exactly what I'm getting at. The parallel-collaborative algorithm sounds great in theory, but we need to check if it fits our simple needs here.
I'm wondering even if the "loom problem" should be relegated to a different path: we don't boot on Loom, perhaps we could switch approach later. But of course no need to complicate things further if the current proposal works well.
Exactly what I'm getting at. The parallel-collaborative algorithm sounds great in theory, but we need to check if it fits our simple needs here.
Let's say that even the move to being parallel here (right before this PR) hasn't been proved to have startup adverse effects already and still merged, so we should need to check that one first, too.
We know the algorithm of this pr and if there is a storm of big jars loads which race, there will be a memory penalty, period (which is good). We could find "workaround" to wait in a way that JFR is happy too (e.g. using synchronized while loading the jar, if racy? ), but this make me think, there was a legit problem or not, to start with?
The previous behaviour (not parallel class loader) was just making the JVM to wrap every I/O potential call with its own locks - like a synchronized (which basically pin the carrier(s) and pass undetected), see https://github.com/openjdk/jdk/blob/jdk-21%2B35/src/hotspot/share/classfile/systemDictionary.cpp#L606 which is creating and ObjectLocker out of the class loader instance. The fact that the JVM wasn't assuming such to be a pinning behaviour is a riddle to me - and probably that's the real problem here:
- it would be good to reach out to some loom folk and make sure why non parallel class loader(s) are allowed to pin threads using synchronized
- this is especially true because recently there has been an improvement on synchronized which allow virtual threads to behave similarly with them as with reentrant locks : which make me ask if the non parallel class loader was indeed doing something which will break in the future or just that the definition of "pinning" will just disappear when this change would land - making the current issue just a temporary problem.
So personally I would like to clarity why the JVM think that the current behaviour (with ReentrantLock) is not good compared to the synchronized one used by the JVM itself; it could be a JFR bug (!) too.
but this make me think, there was a legit problem or not, to start with?
I'm wondering the same. Loading a class is inherently a blocking IO operation which needs to be completed before the program can proceed, so I'd argue that it's fine to block/pin in this case, especially so as there are strong guarantees that loading of any particular class happens only once. I suspect that's why it was classified as "OK" within the JVM, but I don't know that for sure.
We know the algorithm of this pr and if there is a storm of big jars loads which race, there will be a memory penalty, period (which is good).
Why would you say it's good? Personally I'd rather constrain its throughput and have it reuse memory regions rather than trigger a spike.
I meant that is "good" know upfront, so it means it has predictable performance effects, which can be fixed or just accepted. I think is it very feasable to use synchronized to block while racy (per in flight read) - just wonder if it will still make jfr to complain for no reasons, as it was doing before thisnchsnge, or in the future, when synchronized will work as reentrant locks.