quarkus icon indicating copy to clipboard operation
quarkus copied to clipboard

Fix OpenTelemetry context resolution when HTTP Vert.x proactive authentication is enabled

Open michalvavrik opened this issue 2 months ago • 38 comments

What this PR does

  • fixes https://github.com/quarkusio/quarkus/issues/49468
  • tiny refactoring: I mentioned that anonymous identity is not memoized and when we use deferred identity elsewhere in a code, this code every time calls identity manager, resolves the anonymous identity providers and apply augmentors. Now, the resolved anonymous identity is used instead of calling the identity manager multiple times.

Tests (or a reproducer?)

You can use the reproducer attached to the linked issue to test this PR, because it only happens with certain level of concurrency and takes time. What I used is this:

git clone [email protected]:michalvavrik/eager-security-slow-context-activation-reproducer.git
cd eager-security-slow-context-activation-reproducer
mvn clean install
java -jar target/quarkus-app/quarkus-run.jar  2>&1 | tee log
# other window
seq 1 100000 | xargs -n1 -P1000 -I{} curl -v http://localhost:8080/reproducer
# look for "Baggage is missing first value" log message

Explanation of the change

In https://github.com/quarkusio/quarkus/issues/49468, OpenTelemetry contextual data are lost because previous context is incorrectly determined. There is a race, caused by the fact that https://github.com/quarkusio/quarkus/blob/2ce3255df44644a66ffad4b5786d69282bcedbd0/extensions/opentelemetry/runtime/src/main/java/io/quarkus/opentelemetry/runtime/propagation/OpenTelemetryMpContextPropagationProvider.java#L32 invoked by io.smallrye.context.impl.SlowActiveContextState.close(SlowActiveContextState.java:41) is called when other code is using the same duplicated context (you can see in in the text file key-diff that I put into my reproducer). I think it happens in the moment when threads are switched to the worker thread, but it is very hard to debug, I am not sure where in Quarkus REST it happens. However this PR fixes principle, and not one scenario:

  1. https://github.com/quarkusio/quarkus/blob/2ce3255df44644a66ffad4b5786d69282bcedbd0/extensions/vertx-http/runtime/src/main/java/io/quarkus/vertx/http/runtime/security/HttpSecurityRecorder.java#L406 is called from a subscriber onItem wrapper by SR Context Mutiny interceptor
  2. other route handlers run until synchronous code ends and there is a context switch
  3. Uni subscriber callbacks like onItem, OnFailure (onCompletion for publisher etc.) are wrapped by Mutiny interceptor https://github.com/smallrye/smallrye-mutiny/blob/main/context-propagation/src/main/java/io/smallrye/mutiny/context/DefaultContextPropagationInterceptor.java
  4. SR Context propagation provides https://github.com/smallrye/smallrye-context-propagation/blob/main/core/src/main/java/io/smallrye/context/SmallRyeThreadContext.java
  5. SR Ctx propagation is transitive dep of quarkus-mutiny which is transitive dep of quarkus-vertx
  6. Otel defines its own ThreadContextProvider called https://github.com/quarkusio/quarkus/blob/main/extensions/opentelemetry/runtime/src/main/java/io/quarkus/opentelemetry/runtime/propagation/OpenTelemetryMpContextPropagationProvider.java
  7. Then OpenTelemetryMpContextPropagationProvider endContext can be called while Quarkus REST endpoint blocking method is executed and it is all about timing, when it happens, it switches to "prior" OTel context stored on the Vert.x duplicated context is used concurrently

My conclusion is that the fact that the same Vert.x duplicated context can be used concurrently is a bug, but we can easily avoid that as far as Vert.x HTTP Security goes by scheduling routingContext.next() once the thread local is not used. But I'd like to find more efficient way to avoid the "last" callback which needs to happen without SR Ctx propagation, something like switching context for one subscriber.

Test added in this PR to reproduce the issue

Without the fix, the test fails on my laptop most of the time (only way how to make it always fail would be to raise concurrency and number of requests, which however would stretch CI). However if the added fix, following command succeeds:

for i in {1..100}; do
  echo "--- Run $i of 100 ---"
  mvn clean verify -Dtest=OpenTelemetryContextPropagationTest
  if [ $? -ne 0 ]; then
    echo "Command failed on run $i. Stopping loop."
    break
  fi
done

Which should verify the fix added in this PR.

michalvavrik avatar Sep 27 '25 19:09 michalvavrik


:waning_crescent_moon: This workflow status is outdated as a new workflow run has been triggered.


Status for workflow Quarkus CI

This is the status report for running Quarkus CI on commit c7b1533dfc2e499a7f1ccdd61a7f644e20f12665.

:white_check_mark: The latest workflow run for the pull request has completed successfully.

It should be safe to merge provided you have a look at the other checks in the summary.

You can consult the Develocity build scans.


Flaky tests - Develocity

:gear: JVM Tests - JDK 21

:package: extensions/smallrye-openapi/deployment

:x: io.quarkus.smallrye.openapi.test.vertx.OpenApiHttpRootPathCorsTestCase.testCorsFilterProperties - History

  • 1 expectation failed. Expected status code <200> but was <500>. - java.lang.AssertionError
java.lang.AssertionError: 
1 expectation failed.
Expected status code <200> but was <500>.

	at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
	at org.codehaus.groovy.reflection.CachedConstructor.invoke(CachedConstructor.java:73)

quarkus-bot[bot] avatar Sep 27 '25 22:09 quarkus-bot[bot]

I actually have fix proposal (not sure if acceptable) that touches same code, I'll just re-use this PR and switch reviewers. It will require Julian or Clement.

michalvavrik avatar Sep 28 '25 09:09 michalvavrik

Recap: this definitely fixes the issue (cannot reproduce it), but I doubt this will get accepted. Nevertheless, I am not sure what else to do, so I'll wait for CI and then point to this PR in the linked issue and open discussion.

I will disable one test because can't figure what is expected behavior, the fact that multi operator demand has risen seems like a legit issue, but I lack context and I'd like to see if there are other tests failing.

michalvavrik avatar Sep 28 '25 13:09 michalvavrik

CI looked fine but it starts from scratch now that I marked this PR as ready :-).

Hello @jponge and @cescoffier,

with changes in this PR, the https://github.com/quarkusio/quarkus/issues/49468 cannot be reproducer. I think that the moment when onItem Mutiny interceptors (decorators wrapping the onItem callback) continue executing while the same duplicated context is used by other thread is when there is a switch to the worker thread here:

  • https://github.com/quarkusio/quarkus/blob/2ce3255df44644a66ffad4b5786d69282bcedbd0/independent-projects/resteasy-reactive/server/runtime/src/main/java/org/jboss/resteasy/reactive/server/handlers/BlockingHandler.java#L28

I tried to verify this theory by replacing following code:

  • https://github.com/quarkusio/quarkus/blob/6de1108de66a528328c0b4de6bf1f30f825880c4/extensions/resteasy-reactive/rest/runtime/src/main/java/io/quarkus/resteasy/reactive/server/runtime/ResteasyReactiveRecorder.java#L95

with something like this:

public static final Supplier<Executor> EXECUTOR_SUPPLIER = new Supplier<>() {
        @Override
        public Executor get() {
            return new Executor() {
                @Override
                public void execute(@NotNull Runnable command) {
                    if (!BlockingOperationControl.isBlockingAllowed() && VertxContext.isOnDuplicatedContext()) {
                        var ctx = Vertx.currentContext();
                        ctx.runOnContext(new Handler<Void>() {
                            @Override
                            public void handle(Void unused) {
                                ctx.executeBlocking(new Callable<Object>() {
                                    @Override
                                    public Object call() throws Exception {
                                        command.run();
                                        return null;
                                    }
                                });
                            }
                        });
                    } else {
                        ExecutorRecorder.getCurrent().execute(command);
                    }
                }
            };
        }
    };

and the linked issue was gone, but it had other problems (the io.quarkus.resteasy.reactive.server.test.RequestLeakDetectionTest#testWithConcurrentBlockingCalls failed and I did not investigate further).

Could you kindly look at the code of this PR, linked description and suggest if you can, what is core issue, how far off is this PR and what would be proper way to deal with this situation? I understand it could take a long time, just put me in the right direction. Thanks

Also cc @geoand as it concerns Quarkus REST.

michalvavrik avatar Sep 28 '25 16:09 michalvavrik


:waning_crescent_moon: This workflow status is outdated as a new workflow run has been triggered.


Status for workflow Quarkus CI

This is the status report for running Quarkus CI on commit e69022faab0018a07c33f4f41865abedca55034e.

:white_check_mark: The latest workflow run for the pull request has completed successfully.

It should be safe to merge provided you have a look at the other checks in the summary.

You can consult the Develocity build scans.


Flaky tests - Develocity

:gear: JVM Tests - JDK 17

:package: extensions/smallrye-reactive-messaging/deployment

:x: io.quarkus.smallrye.reactivemessaging.hotreload.ConnectorChangeTest.testUpdatingConnector - History

  • Expecting actual: ["-6","-7","-9","-10","-11","-12","-13","-14"] to start with: ["-6", "-7", "-8", "-9"] - java.lang.AssertionError
java.lang.AssertionError: 

Expecting actual:
  ["-6","-7","-9","-10","-11","-12","-13","-14"]
to start with:
  ["-6", "-7", "-8", "-9"]

	at io.quarkus.smallrye.reactivemessaging.hotreload.ConnectorChangeTest.testUpdatingConnector(ConnectorChangeTest.java:41)

quarkus-bot[bot] avatar Sep 28 '25 19:09 quarkus-bot[bot]

I actually think this one is also related: https://github.com/quarkusio/quarkus/issues/40918

Because I use Quarkus Security as well and have seen both the error mentioned in this ticket and the other one

melloware avatar Sep 28 '25 19:09 melloware

It would be nice if we had a test for the reported condition.

brunobat avatar Sep 29 '25 10:09 brunobat

It would be nice if we had a test for the reported condition.

You are right. I don't think it will be a problem to create a test that reproduces the behavior for my workstation, I just wasn't sure how reliable it will be for slower machines etc. Anyway, no worry, once I have signal that this could be even remotely acceptable, I'll add test.

My guess is that re-scheduling same HTTP request when the Quarkus Security is present just because of this will not be acceptable. I kinda hope @jponge will put me in a direction how to combine contexts (one empty without SR Ctx propagation and other with it)....

michalvavrik avatar Sep 29 '25 10:09 michalvavrik

@michalvavrik you might confuse me with @vietj who is the Vert.x maintainer

jponge avatar Sep 30 '25 19:09 jponge

@michalvavrik you might confuse me with @vietj who is the Vert.x maintainer

Not really, but maybe I evaluated wrongly what I need? I thought the easiest way to fix this concurrency issue is to simply call routingContext.next() after Mutiny subscriber has completed because if the callback only does rc.next() I don't really need a propagation context for it, just to run on the right duplicated context. But the Mutiny subscriber callbacks are being intercepted by Mutiny interceptor provided by a SmallRye Context Propagation.

To put it simple, I don't see a way to avoid io.smallrye.mutiny.infrastructure.Infrastructure#UNI_INTERCEPTORS being applied on a Mutiny completion event. I thought you may have a thoughts because what we do in HttpSecurityRecorder must be a common use case. If not, no problem and thank you, I'll find another way how to fix it.

michalvavrik avatar Sep 30 '25 19:09 michalvavrik

I'm not the best expert around that particular area (SR context propagation) but I'll try to have a look next week

jponge avatar Sep 30 '25 21:09 jponge

I'm not the best expert around that particular area (SR context propagation) but I'll try to have a look next week

Thanks, I'd really appreciate to hear your thoughts, you don't need to solve it, just put me in the right direction :milky_way:

michalvavrik avatar Oct 02 '25 10:10 michalvavrik

I eventually found this PR from @michalvavrik, while trying to find out if there are some security issues opened related to closing the OpenTelemetry scope warnings, so this PR looks relevant :-)

sberyozkin avatar Oct 08 '25 15:10 sberyozkin

I'm not the best expert around that particular area (SR context propagation) but I'll try to have a look next week

I checked today how SR Ctx propagation works and I think we could apply wrappers that only use correct context if some flag is set to true and leverage try(CleanAutoCloseable ac = SmallRyeThreadContext.withThreadContext(threadContext)) and ThreadContext.ALL_REMAINING, but this situation seem to me like a Mutiny limitation:

  1. we need the context propagation for normal callbacks like onItem transformation (that is, on the same Uni)
  2. but if the only thing that is happening in the subscriber onItem (I am referring to the last onItem) is routingContext.next(), I need no CDI or Otel for that callback, the only thing that I need is that routingContext.next() is invoked on the correct duplicated context.

But if you think it is the context propagation issue, I can ask around. @FroMage and @manovotn worked on it, maybe you can give me a hint on what should I look into if I want to avoid the context propagation only for certain mutiny callbacks (those that use routingContext.next()), please?

michalvavrik avatar Oct 11 '25 14:10 michalvavrik

It looks like you're saying that the problem is that we can call routingContext.next() from the wrong duplicated Vert.x context?

If yes, then indeed this has nothing to do with MP-CP, which does not deal with Vert.x context switching. This is a Vert.x question. Probably @cescoffier will have your answer.

FroMage avatar Oct 13 '25 13:10 FroMage

It looks like you're saying that the problem is that we can call routingContext.next() from the wrong duplicated Vert.x context?

@FroMage I know there is a lot of description in this PR so let me just highlight what I personally think is an issue:

  1. Vert.x HTTP authentication handler needs to subscribe to Uni to resolve SecurityIdentity and then call routingContext.next() in the subscribed onItem
  2. the onItem callback is wrapper by MutinyInterceptors which wraps all callbacks with SmallRye Context propagation, so org.eclipse.microprofile.context.spi.ThreadContextSnapshot#begin is called before onItem is invoked and org.eclipse.microprofile.context.spi.ThreadContextController#endContext is called after the onItem callbck is executed
  3. the onItem callback calls routingContext.next() (so that is inside, before the endContext() invocation)
  4. when there is switch between an IO thread and the worker thread [this part I am least sure about, it is really hard to determine the exact point], the routingContext.next() invocation ends and the org.eclipse.microprofile.context.spi.ThreadContextController#endContext is executed
  5. hence, there are 2 codes reusing the same local data, one from the org.eclipse.microprofile.context.spi.ThreadContextController#endContext and one from the REST endpoint user code

As far as the context propagation contexts is concerned, the org.eclipse.microprofile.context.spi.ThreadContextController#endContext cannot be executed while other code is executed because it will cause random behavior.

Hope it is bit clearer, let me know if it isn't.

michalvavrik avatar Oct 13 '25 13:10 michalvavrik

And yeah, re-scheduling routingContext.next() fixes the issue, but I don't think it is the right thing to do as it probably has performance implications? I just wanted to show it so that we can find proper solution.

michalvavrik avatar Oct 13 '25 13:10 michalvavrik

I'm not entirely sure to be honest. I suppose routingContext.next() actually continues the current request, rather than switch to a new request, so they should both have the same vertx duplicated context, no ? Unless the rest of the request is fired on a different thread pool?

Here's another lead: the way it works is that Mutiny, when you call onItem(Callback) will invoke BaseContextPropagationInterceptor.decorate(Callback) which will call context.contextualXXX(Callback) if the lambda is not already contextualised. This would capture the current context right now, and restore it when the callback is called.

If the lambda you pass to onItem() implements Contextualized then it won't be contextualised, it won't capture anything, and nothing will be restored. Which means it will inherit whatever thread locals are lying around. This could be what you're after if the next thread is new, or horrible if it has random left-over thread locals.

You can try that, see how things work.

Or (probably better) you can build a lambda to pass to onItem() that is properly contextualised for all cleared contexts (clean thread locals):

// you can save this one around for every time you need it
ThreadContext cleanThreadContext = ThreadContext.builder().propagated(ThreadContext.NONE).cleared(ThreadContext.ALL_REMAINING).build();
uni.onItem(cleanThreadContext.contextualCallable(() -> { rc.next(); }));

FroMage avatar Oct 13 '25 14:10 FroMage

I'm not entirely sure to be honest. I suppose routingContext.next() actually continues the current request, rather than switch to a new request, so they should both have the same vertx duplicated context, no ? Unless the rest of the request is fired on a different thread pool?

This is indeed how it works, it goes to the next handler in the chain, and there's not thread hop.

jponge avatar Oct 13 '25 15:10 jponge

I'm not entirely sure to be honest. I suppose routingContext.next() actually continues the current request, rather than switch to a new request, so they should both have the same vertx duplicated context, no ? Unless the rest of the request is fired on a different thread pool?

This is indeed how it works, it goes to the next handler in the chain, and there's not thread hop.

I spend whole weekend debugging it, I realize that, but maybe I didn't stress out that Quarkus REST endpoint is executed on a worker thread and there is a switch between IO thread and the worker thread. Maybe I am just not able explain this easily, sorry :man_shrugging:

If the lambda you pass to onItem() implements Contextualized then it won't be contextualised, it won't capture anything, and nothing will be restored. Which means it will inherit whatever thread locals are lying around. This could be what you're after if the next thread is new, or horrible if it has random left-over thread locals.

That is super helpful @FroMage , thanks, I am going to re-read and try it tomorrow. That sounds exactly like what I need.

michalvavrik avatar Oct 13 '25 17:10 michalvavrik

It definitely fixed the reproducer, thank you @FroMage (and @jponge )! I'll have to write a test as @brunobat reminded me and re-run CI if something else broken, but I'll ask you for a review if CI passes.

michalvavrik avatar Oct 14 '25 08:10 michalvavrik

I have to put this PR away for a while because I actually need to get something done. I think the issue is serious in a sense that whenever is OpenTelemetry and Security used together and there is for example switch to the blocking executor like in blocking Quarkus REST endpoints, it can randomly switch to a wrong context. I'll contribute few other things to Quarkus and come back to it unless someone solves it sooner (I'd very much welcome that!).

I have pushed test @brunobat , read comments for more context.

Regarding fix. Yes, using ThreadContext cleanThreadContext = ThreadContext.builder().propagated(ThreadContext.NONE) .cleared(ThreadContext.ALL_REMAINING).build(); or ThreadContext.builder().propagated() .unchanged().cleared().build() fixes this issue, but the issue is that in this scenario, propagated contexts are CDI and OpenTelemetry. And whenever CDI is cleared and there was null CDI context previously, it activates a new CDI request context: https://github.com/quarkusio/quarkus/blob/cfe160745abf7bfcdc06b2e5b0e6e6465a8626c0/extensions/arc/runtime/src/main/java/io/quarkus/arc/runtime/context/ArcContextProvider.java#L85

Apart of the fact it fails most of the scenarios for current impl. (which could be fixed), we just don't want to activate it so early, it causes issue and is not necessary. I have experimented with unchanged and there are 2 other callbacks that also switch to the wrong context "randomly". It is hard to identify them as stacktrace is useless in this case, it is all callbacks of callbacks.

I've tweaked security code a little today, which as a side-effect would allow anyone to easily experiment with different thread contexts. These are 2 consumers you may want to experiment with:

  • https://github.com/quarkusio/quarkus/blob/cfe160745abf7bfcdc06b2e5b0e6e6465a8626c0/extensions/vertx-http/runtime/src/main/java/io/quarkus/vertx/http/runtime/security/HttpSecurityRecorder.java#L395
  • https://github.com/quarkusio/quarkus/blob/cfe160745abf7bfcdc06b2e5b0e6e6465a8626c0/extensions/vertx-http/runtime/src/main/java/io/quarkus/vertx/http/runtime/security/AbstractHttpAuthorizer.java#L68

The second one (the authorizer) is irrelevant if in your app are not HTTP permissions like in the reproducer (but unlike in the test I added in this PR).

michalvavrik avatar Oct 19 '25 22:10 michalvavrik

@michalvavrik, I'll setup a meeting about this, if you don't mind.

brunobat avatar Oct 20 '25 08:10 brunobat

@michalvavrik, I'll setup a meeting about this, if you don't mind.

I have a lot of release tasks this week, but 25 minutes is fine, thanks for your time.

michalvavrik avatar Oct 20 '25 08:10 michalvavrik

Is there any progress on this?

melloware avatar Nov 18 '25 12:11 melloware

Is there any progress on this?

Me and @brunobat met. This PR fixes the issue and fails at least one test (I mentioned extensions/resteasy-reactive/rest-jackson/deployment/src/test/java/io/quarkus/resteasy/reactive/jackson/deployment/test/streams/StreamTestCase.java before), but I cannot say if the test assumptions are correct. Personally I think they are wrong, that's why I asked @jponge about that test, because I have never seen AbstractMultiOperator and MultiOperatorProcessor. I didn't find if that is how demands should work.

Anyway, this PR in current form is basically workaround, it leverages the fact that the workload is scheduled after this local thread and duplicated context instance are not used. I cannot say if there are performance implications, if I have to guess I think they are, which is why I hoped @cescoffier will give me a hint if I am right.

As for the clean thread context suggestions, due to what I explained before - https://github.com/quarkusio/quarkus/pull/50310#issuecomment-3420010887 - I think that alone is not enough.

If I don't have any new input, I'll go back and give it one more try.

michalvavrik avatar Nov 25 '25 09:11 michalvavrik

Switching randomly to a wrong context is definitely not good. The wrong context is used in Quarkus REST?

Probably @mkouba and @geoand should also be looking at this.

FroMage avatar Nov 26 '25 11:11 FroMage

Switching randomly to a wrong context is definitely not good

Which context are we talking about?

geoand avatar Nov 26 '25 11:11 geoand

Vert.x, I think. But honestly this issue has confused me in the past 😓

FroMage avatar Nov 26 '25 11:11 FroMage

This is a summary from what Michal explained to me earlier:

  • The same Vert.x duplicated context can be used concurrently
  • Reproducer shows that at one moment OpenTelemetry context is A and on the next line of synchronous code inside Quarkus REST endpoint, it is another OpenTelemetry context B. OpenTelemetry wrapps our mutiny onItem where security calls routingContext.next() and that wrapper code switches the context into a wrong one. The issue is that OpenTelemetry context cannot count with the fact that someone else is using the same context.

sberyozkin avatar Dec 01 '25 09:12 sberyozkin