This is a proposed solution for the problem reported here.

I'm avoiding blocking on the ReadWriteLock and emulating its behavior with a state machine + an atomic readers counter because the zipFile access may happen inside a native call (for example triggered by the RunnerClassLoader) and then it is necessary to avoid blocking on it.

Note that while avoid blocking and thus preventing pinning, I'm afraid that this solution could suffer of monopolization. I tried to mitigate the problem with the yields at lines 160 and 183 but I'm not sure if this is a good idea, especially with virtual threads. Any feedback or suggestion to improve is welcome.

/cc @franz1981 @geoand @Sanne @dmlloyd @cescoffier

Jun 03 '24 15:06 mariofusco

Thanks for your pull request!

The title of your pull request does not follow our editorial rules. Could you have a look?

title should preferably start with an uppercase character (if it makes sense!)

_{This message is automatically generated by a bot.}

Jun 03 '24 15:06 quarkus-bot[bot]

:waning_crescent_moon: This workflow status is outdated as a new workflow run has been triggered.

Status for workflow `Quarkus CI`

This is the status report for running Quarkus CI on commit c2c8c189be75ce3b30c016ad0d3d8e9e7666cf7a.

Failing Jobs

Status	Name	Step	Failures	Logs	Raw logs	Build scan
✖	Initial JDK 17 Build	`Build`	Failures	Logs	Raw logs	:mag:

You can consult the Develocity build scans.

Failures

:gear: Initial JDK 17 Build #

- Failing: independent-projects/bootstrap/runner 
! Skipped: core/deployment core/runtime devtools/bom-descriptor-json and 844 more

:package: independent-projects/bootstrap/runner

✖ Failed to execute goal net.revelc.code.formatter:formatter-maven-plugin:2.24.0:validate (default) on project quarkus-bootstrap-runner: File '/home/runner/work/quarkus/quarkus/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/JarResource.java' has not been previously formatted. Please format file (for example by invoking `mvn -f independent-projects/bootstrap/runner net.revelc.code.formatter:formatter-maven-plugin:2.24.0:format`) and commit before running validation!

Jun 03 '24 15:06 quarkus-bot[bot]

✖ Failed to execute goal net.revelc.code.formatter:formatter-maven-plugin:2.24.0:validate (default) on project quarkus-bootstrap-runner: File '/home/runner/work/quarkus/quarkus/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/JarResource.java' has not been previously formatted. Please format file (for example by invoking `mvn -f independent-projects/bootstrap/runner net.revelc.code.formatter:formatter-maven-plugin:2.24.0:format`) and commit before running validation!

I tried to format the code using that maven command, but when I run it, it doesn't change anything in the java source (or in any other file). Am I missing something?

Jun 03 '24 16:06 mariofusco

I would make a separate RefCntResourceFile class which contains a volatile field pointing to the file to create and an atomic ref count to handle ownership and eventually the close (the release which put the ref can't back to zero got the rights to close the file)

In the JarResource I would have an AtomicReference<RefCntResource> which can be in the states null <-> not null, but thanks to RefCntResource being a new instance on each opening transition, would avoid ABA issues.

The algorithm can be made more cooperative and progressive by allowing contended operations to "help" eg concurrent opens can compete to populate the volatile field in the currently set RefCntResourceFile, avoiding to spin wait, while allocating more memory. Similarly a close doesn't need to wait till all readers are gone but can perform a RefCntResourceFile::release which if true will make it to attempt moving to null the shared AtomicReference<RefCntResource>, while it false, will likely be picked by the reader while finishing to read (each read must be surrounded by a retain and release on the current RefCntResource).

Jun 03 '24 17:06 franz1981

✖ Failed to execute goal net.revelc.code.formatter:formatter-maven-plugin:2.24.0:validate (default) on project quarkus-bootstrap-runner: File '/home/runner/work/quarkus/quarkus/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/JarResource.java' has not been previously formatted. Please format file (for example by invoking `mvn -f independent-projects/bootstrap/runner net.revelc.code.formatter:formatter-maven-plugin:2.24.0:format`) and commit before running validation!

I tried to format the code using that maven command, but when I run it, it doesn't change anything in the java source (or in any other file). Am I missing something?

https://quarkusio.zulipchat.com/#narrow/stream/187038-dev/topic/source.20code.20not.20formatted seems to be relevant

Jun 04 '24 05:06 geoand

I reworked this pull request following @franz1981's suggestions. Please review again.

Jun 04 '24 12:06 mariofusco

@franz1981 I tried to follow all your suggestions and I believe I reached a far better level of encapsulation in the JarFileReference with the introduction of a non-capturing lambda that has the purpose of consuming the jarFile hiding all the complexities of retrieving a valid non closed instance and keeping track of the readers counting. The only thing that I haven't been able to remove is the need of that boolean releasing and I tried to explain why I believe that it is still necessary. Please review again :pray:

Jun 04 '24 14:06 mariofusco

Also, when all else is done, we'll need to squash the commits

Jun 04 '24 14:06 geoand

@franz1981 I think I've finally understood what you meant (I'm getting old) and implemented the reference counter accordingly. Please review again.

Jun 05 '24 09:06 mariofusco

Given the amount of work and discussions related to this issue, could we have a simple revert PR targeting 3.11?

Whatever gets into 3.12 wouldn't be backportable.

Thanks.

Jun 05 '24 11:06 gsmet

Given the amount of work and discussions related to this issue, could we have a simple revert PR targeting 3.11?

+1

Jun 05 '24 12:06 geoand

Given the amount of work and discussions related to this issue, could we have a simple revert PR targeting 3.11?

Whatever gets into 3.12 wouldn't be backportable.

Do you mean we will have a 3.11.1 patch release where we simply revert the use of registerAsParallelCapable() on the RunnerClassLoader while we will have this fix (when it will be ready, we should be almost there) on the 3.12.0? If so, that's ok for me.

Jun 05 '24 12:06 mariofusco

Yeah, @gsmet is proposing reverting #40033 in the 3.11 branch - which would be done by opening a PR against that branch specifically

Jun 05 '24 12:06 geoand

Also, when all else is done, we'll need to squash the commits

Bumping this so it does not get lost :)

Jun 05 '24 12:06 geoand

Yeah, @gsmet is proposing reverting https://github.com/quarkusio/quarkus/pull/40033 in the 3.11 branch - which would be done by opening a PR against that branch specifically

For what I have experimented these days, I don't think that a full revert will be necessary, but it would be sufficient to just remove the statement calling registerAsParallelCapable() on the RunnerClassLoader, anyway up to you.

Also, when all else is done, we'll need to squash the commits

Bumping this so it does not get lost :)

That's ok, I didn't forget it, I just wanted to finish discussing all the details with @franz1981 before squashing :)

Jun 05 '24 12:06 mariofusco

For what I have experimented these days, I don't think that a full revert will be necessary, but it would be sufficient to just remove the statement calling registerAsParallelCapable() on the RunnerClassLoader, anyway up to you.

Yeah, that's almost certainly the case, but I would say that it's safest to go back to the exact previous state if we are not going to include this

Jun 05 '24 12:06 geoand

:waning_crescent_moon: This workflow status is outdated as a new workflow run has been triggered.

Status for workflow `Quarkus CI`

This is the status report for running Quarkus CI on commit 198ff4941fccc82d793fe7829a6c7609bbfe9bc5.

Failing Jobs

Status	Name	Step	Failures	Logs	Raw logs	Build scan
✖	Kubernetes Tests - JDK 17	`Build`	:warning: Check →	Logs	Raw logs	:construction:
:heavy_check_mark:	Kubernetes Tests - JDK 21			Logs	Raw logs	:mag:
:heavy_check_mark:	Kubernetes Tests - JDK 17 Windows			Logs	Raw logs	:mag:

You can consult the Develocity build scans.

Flaky tests - Develocity

:gear: JVM Tests - JDK 17

:package: integration-tests/reactive-messaging-kafka

✖ io.quarkus.it.kafka.KafkaConnectorTest.testFruits - History

Assertion condition defined as a Lambda expression in io.quarkus.it.kafka.KafkaConnectorTest expected: <6> but was: <5> within 10 seconds. - org.awaitility.core.ConditionTimeoutException

org.awaitility.core.ConditionTimeoutException: Assertion condition defined as a Lambda expression in io.quarkus.it.kafka.KafkaConnectorTest expected: <6> but was: <5> within 10 seconds.
	at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
	at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119)
	at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31)
	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:1006)
	at org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:790)
	at io.quarkus.it.kafka.KafkaConnectorTest.testFruits(KafkaConnectorTest.java:63)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)

:gear: JVM Tests - JDK 21

:package: integration-tests/reactive-messaging-kafka

✖ io.quarkus.it.kafka.KafkaConnectorTest.testFruits - History

Assertion condition defined as a Lambda expression in io.quarkus.it.kafka.KafkaConnectorTest expected: <6> but was: <5> within 10 seconds. - org.awaitility.core.ConditionTimeoutException

org.awaitility.core.ConditionTimeoutException: Assertion condition defined as a Lambda expression in io.quarkus.it.kafka.KafkaConnectorTest expected: <6> but was: <5> within 10 seconds.
	at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
	at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119)
	at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31)
	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:1006)
	at org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:790)
	at io.quarkus.it.kafka.KafkaConnectorTest.testFruits(KafkaConnectorTest.java:63)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)

Jun 05 '24 15:06 quarkus-bot[bot]

Yeah, @gsmet is proposing reverting #40033 in the 3.11 branch - which would be done by opening a PR against that branch specifically

For what I have experimented these days, I don't think that a full revert will be necessary, but it would be sufficient to just remove the statement calling registerAsParallelCapable() on the RunnerClassLoader, anyway up to you.

Also, when all else is done, we'll need to squash the commits

Bumping this so it does not get lost :)

That's ok, I didn't forget it, I just wanted to finish discussing all the details with @franz1981 before squashing :)

I don't think that'd be sufficient; in addition to making the CL parallel-capable, it also removes locking on most of the class loading paths, which will continue to lead to the same issue. You'd have to also re-add the synchronization statements.

You can do it in a slightly more revertible way by doing things like this:

Class<?> loadClass(String name) {
    // todo: remove
    if (! Thread.holdsLock(getClassLoadingLock(name))) {
        synchronized (getClassLoadingLock(name)) {
            return loadClass(name);
        }
    }
    // end todo

Then once the locking situation is fixed, just that block can be removed (instead of re-whitespacing the whole method).

Jun 05 '24 16:06 dmlloyd

I just created https://github.com/quarkusio/quarkus/pull/41120 for main.

Jun 11 '24 13:06 gsmet

Can you please squash the commits in this PR?

Jun 11 '24 13:06 gastaldi

@franz1981 and I did some pair programming today and we heavily reworked and simplified the lock-free algorithm. Now we are much more confident on both correctness and performances. I also rebased and squashed all commits are requested.

@geoand @dmlloyd @gsmet Please give a look. At this point I believe that this would be a better solution than reverting the original commit that introduced the parallel classloader.

Jun 11 '24 15:06 mariofusco

Given that this is a non trivial change, I think it would be best merging the revert so we can go with the old stuff in 3.12 (the CR of which will be cut tomorrow) and reviewing this more and merging into main after 3.12 has been branched.

Jun 11 '24 15:06 geoand

:waning_crescent_moon: This workflow status is outdated as a new workflow run has been triggered.

Status for workflow `Quarkus CI`

This is the status report for running Quarkus CI on commit cfe603048a1229dd409724f1dd236504193e88f3.

Failing Jobs

Status	Name	Step	Failures	Logs	Raw logs	Build scan
✖	Native Tests - Data1	`Build`	Failures	Logs	Raw logs	:construction:

Failures

:gear: Native Tests - Data1 #

- Failing: integration-tests/jpa-mssql

:package: integration-tests/jpa-mssql

✖ Failed to execute goal io.fabric8:docker-maven-plugin:0.44.0:start (docker-start) on project quarkus-integration-test-jpa-mssql: I/O Error

Jun 11 '24 21:06 quarkus-bot[bot]

This is now a super progressive algorithm...so progressive that it doesn't spin at all, but pay some memory footprint to avoid contention i.e. 2 racy readers end up having both to make progress but just one of the two has already won the race to cache the read jar file, while the other is going to close it while finished.

That sounds brilliant - what about allocations?

Jun 12 '24 08:06 Sanne

That sounds brilliant - what about allocations?

@Sanne that's why I've asked to you :)

If the problem of using lock(s) (can be either synchronized/Reentrant ones) within "critical code paths" like such (according to the Virtual Threads, actually) need to fixed there are not so many paths to fix it here:

always spin/yield waits who is in charge to read the file, but that can make some very bad spikes in the cpu usage, which is not nice in the context of container(s) either
have specialized behaviours based on the caller thread .i.e. virtual threads can waste more memory during a racy read, while platform threads can just "join" the current race, waiting - but this is not what the current PR does

Let's remember that if the current thread is a virtual thread we're not allowed, within a call stack with defineClass to perform any form of park/unpark (that's what locks would do - and the same CompletableFuture, too).

Being pragmatic, the expectation is that there shouldn't be many parallel class loading events (fingers crossed).

Jun 12 '24 09:06 franz1981

@franz1981 did you check that it doesn't have any unexpected adverse effect on startup performance?

Jun 12 '24 09:06 gsmet

Being pragmatic, the expectation is that there shouldn't be many parallel class loading events (fingers crossed).

I agree with that - still we want to be able to load a burst of classes sequentially without requiring a lot of additional memory: think of bootstrap and the challenge to not run out of memory during the initial bootstrap, which could be memory intensive for other reasons, in a constrained environment.

It does us no good to be able to perform well with 50MB of ram if we need significantly more to initialize.

did you check that it doesn't have any unexpected adverse effect on startup performance?

Exactly what I'm getting at. The parallel-collaborative algorithm sounds great in theory, but we need to check if it fits our simple needs here.

I'm wondering even if the "loom problem" should be relegated to a different path: we don't boot on Loom, perhaps we could switch approach later. But of course no need to complicate things further if the current proposal works well.

Jun 12 '24 09:06 Sanne

Exactly what I'm getting at. The parallel-collaborative algorithm sounds great in theory, but we need to check if it fits our simple needs here.

Let's say that even the move to being parallel here (right before this PR) hasn't been proved to have startup adverse effects already and still merged, so we should need to check that one first, too.

We know the algorithm of this pr and if there is a storm of big jars loads which race, there will be a memory penalty, period (which is good). We could find "workaround" to wait in a way that JFR is happy too (e.g. using synchronized while loading the jar, if racy? ), but this make me think, there was a legit problem or not, to start with?

The previous behaviour (not parallel class loader) was just making the JVM to wrap every I/O potential call with its own locks - like a synchronized (which basically pin the carrier(s) and pass undetected), see https://github.com/openjdk/jdk/blob/jdk-21%2B35/src/hotspot/share/classfile/systemDictionary.cpp#L606 which is creating and ObjectLocker out of the class loader instance. The fact that the JVM wasn't assuming such to be a pinning behaviour is a riddle to me - and probably that's the real problem here:

it would be good to reach out to some loom folk and make sure why non parallel class loader(s) are allowed to pin threads using synchronized
this is especially true because recently there has been an improvement on synchronized which allow virtual threads to behave similarly with them as with reentrant locks : which make me ask if the non parallel class loader was indeed doing something which will break in the future or just that the definition of "pinning" will just disappear when this change would land - making the current issue just a temporary problem.

So personally I would like to clarity why the JVM think that the current behaviour (with ReentrantLock) is not good compared to the synchronized one used by the JVM itself; it could be a JFR bug (!) too.

Jun 12 '24 09:06 franz1981

but this make me think, there was a legit problem or not, to start with?

I'm wondering the same. Loading a class is inherently a blocking IO operation which needs to be completed before the program can proceed, so I'd argue that it's fine to block/pin in this case, especially so as there are strong guarantees that loading of any particular class happens only once. I suspect that's why it was classified as "OK" within the JVM, but I don't know that for sure.

We know the algorithm of this pr and if there is a storm of big jars loads which race, there will be a memory penalty, period (which is good).

Why would you say it's good? Personally I'd rather constrain its throughput and have it reuse memory regions rather than trigger a spike.

Jun 12 '24 10:06 Sanne

I meant that is "good" know upfront, so it means it has predictable performance effects, which can be fixed or just accepted. I think is it very feasable to use synchronized to block while racy (per in flight read) - just wonder if it will still make jfr to complain for no reasons, as it was doing before thisnchsnge, or in the future, when synchronized will work as reentrant locks.

Jun 12 '24 10:06 franz1981

quarkus
quarkus copied to clipboard

Replace read/write lock in JarResource to avoid virtual threads pinning

Status for workflow `Quarkus CI`

Failing Jobs

Failures

:gear: Initial JDK 17 Build #

:package: independent-projects/bootstrap/runner

Status for workflow `Quarkus CI`

Failing Jobs

Flaky tests - Develocity

:gear: JVM Tests - JDK 17

:package: integration-tests/reactive-messaging-kafka

:gear: JVM Tests - JDK 21

:package: integration-tests/reactive-messaging-kafka

Status for workflow `Quarkus CI`

Failing Jobs

Failures

:gear: Native Tests - Data1 #

:package: integration-tests/jpa-mssql

quarkus quarkus copied to clipboard

Replace read/write lock in JarResource to avoid virtual threads pinning

Status for workflow Quarkus CI

Failing Jobs

Failures

:gear: Initial JDK 17 Build #

:package: independent-projects/bootstrap/runner

Status for workflow Quarkus CI

Failing Jobs

Flaky tests - Develocity

:gear: JVM Tests - JDK 17

:package: integration-tests/reactive-messaging-kafka

:gear: JVM Tests - JDK 21

:package: integration-tests/reactive-messaging-kafka

Status for workflow Quarkus CI

Failing Jobs

Failures

:gear: Native Tests - Data1 #

:package: integration-tests/jpa-mssql

quarkus
quarkus copied to clipboard

Status for workflow `Quarkus CI`

Status for workflow `Quarkus CI`

Status for workflow `Quarkus CI`