Fix OpenTelemetry context resolution when HTTP Vert.x proactive authentication is enabled
What this PR does
- fixes https://github.com/quarkusio/quarkus/issues/49468
- tiny refactoring: I mentioned that anonymous identity is not memoized and when we use deferred identity elsewhere in a code, this code every time calls identity manager, resolves the anonymous identity providers and apply augmentors. Now, the resolved anonymous identity is used instead of calling the identity manager multiple times.
Tests (or a reproducer?)
You can use the reproducer attached to the linked issue to test this PR, because it only happens with certain level of concurrency and takes time. What I used is this:
git clone [email protected]:michalvavrik/eager-security-slow-context-activation-reproducer.git
cd eager-security-slow-context-activation-reproducer
mvn clean install
java -jar target/quarkus-app/quarkus-run.jar 2>&1 | tee log
# other window
seq 1 100000 | xargs -n1 -P1000 -I{} curl -v http://localhost:8080/reproducer
# look for "Baggage is missing first value" log message
Explanation of the change
In https://github.com/quarkusio/quarkus/issues/49468, OpenTelemetry contextual data are lost because previous context is incorrectly determined. There is a race, caused by the fact that https://github.com/quarkusio/quarkus/blob/2ce3255df44644a66ffad4b5786d69282bcedbd0/extensions/opentelemetry/runtime/src/main/java/io/quarkus/opentelemetry/runtime/propagation/OpenTelemetryMpContextPropagationProvider.java#L32 invoked by io.smallrye.context.impl.SlowActiveContextState.close(SlowActiveContextState.java:41) is called when other code is using the same duplicated context (you can see in in the text file key-diff that I put into my reproducer). I think it happens in the moment when threads are switched to the worker thread, but it is very hard to debug, I am not sure where in Quarkus REST it happens. However this PR fixes principle, and not one scenario:
- https://github.com/quarkusio/quarkus/blob/2ce3255df44644a66ffad4b5786d69282bcedbd0/extensions/vertx-http/runtime/src/main/java/io/quarkus/vertx/http/runtime/security/HttpSecurityRecorder.java#L406 is called from a subscriber
onItemwrapper by SR Context Mutiny interceptor - other route handlers run until synchronous code ends and there is a context switch
- Uni subscriber callbacks like onItem, OnFailure (onCompletion for publisher etc.) are wrapped by Mutiny interceptor https://github.com/smallrye/smallrye-mutiny/blob/main/context-propagation/src/main/java/io/smallrye/mutiny/context/DefaultContextPropagationInterceptor.java
- SR Context propagation provides https://github.com/smallrye/smallrye-context-propagation/blob/main/core/src/main/java/io/smallrye/context/SmallRyeThreadContext.java
- SR Ctx propagation is transitive dep of
quarkus-mutinywhich is transitive dep ofquarkus-vertx - Otel defines its own
ThreadContextProvidercalled https://github.com/quarkusio/quarkus/blob/main/extensions/opentelemetry/runtime/src/main/java/io/quarkus/opentelemetry/runtime/propagation/OpenTelemetryMpContextPropagationProvider.java - Then
OpenTelemetryMpContextPropagationProviderendContextcan be called while Quarkus REST endpoint blocking method is executed and it is all about timing, when it happens, it switches to "prior" OTel context stored on the Vert.x duplicated context is used concurrently
My conclusion is that the fact that the same Vert.x duplicated context can be used concurrently is a bug, but we can easily avoid that as far as Vert.x HTTP Security goes by scheduling routingContext.next() once the thread local is not used. But I'd like to find more efficient way to avoid the "last" callback which needs to happen without SR Ctx propagation, something like switching context for one subscriber.
Test added in this PR to reproduce the issue
Without the fix, the test fails on my laptop most of the time (only way how to make it always fail would be to raise concurrency and number of requests, which however would stretch CI). However if the added fix, following command succeeds:
for i in {1..100}; do
echo "--- Run $i of 100 ---"
mvn clean verify -Dtest=OpenTelemetryContextPropagationTest
if [ $? -ne 0 ]; then
echo "Command failed on run $i. Stopping loop."
break
fi
done
Which should verify the fix added in this PR.
:waning_crescent_moon: This workflow status is outdated as a new workflow run has been triggered.
Status for workflow Quarkus CI
This is the status report for running Quarkus CI on commit c7b1533dfc2e499a7f1ccdd61a7f644e20f12665.
:white_check_mark: The latest workflow run for the pull request has completed successfully.
It should be safe to merge provided you have a look at the other checks in the summary.
You can consult the Develocity build scans.
Flaky tests - Develocity
:gear: JVM Tests - JDK 21
:package: extensions/smallrye-openapi/deployment
:x: io.quarkus.smallrye.openapi.test.vertx.OpenApiHttpRootPathCorsTestCase.testCorsFilterProperties - History
1 expectation failed. Expected status code <200> but was <500>.-java.lang.AssertionError
java.lang.AssertionError:
1 expectation failed.
Expected status code <200> but was <500>.
at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
at org.codehaus.groovy.reflection.CachedConstructor.invoke(CachedConstructor.java:73)
I actually have fix proposal (not sure if acceptable) that touches same code, I'll just re-use this PR and switch reviewers. It will require Julian or Clement.
Recap: this definitely fixes the issue (cannot reproduce it), but I doubt this will get accepted. Nevertheless, I am not sure what else to do, so I'll wait for CI and then point to this PR in the linked issue and open discussion.
I will disable one test because can't figure what is expected behavior, the fact that multi operator demand has risen seems like a legit issue, but I lack context and I'd like to see if there are other tests failing.
CI looked fine but it starts from scratch now that I marked this PR as ready :-).
Hello @jponge and @cescoffier,
with changes in this PR, the https://github.com/quarkusio/quarkus/issues/49468 cannot be reproducer. I think that the moment when onItem Mutiny interceptors (decorators wrapping the onItem callback) continue executing while the same duplicated context is used by other thread is when there is a switch to the worker thread here:
- https://github.com/quarkusio/quarkus/blob/2ce3255df44644a66ffad4b5786d69282bcedbd0/independent-projects/resteasy-reactive/server/runtime/src/main/java/org/jboss/resteasy/reactive/server/handlers/BlockingHandler.java#L28
I tried to verify this theory by replacing following code:
- https://github.com/quarkusio/quarkus/blob/6de1108de66a528328c0b4de6bf1f30f825880c4/extensions/resteasy-reactive/rest/runtime/src/main/java/io/quarkus/resteasy/reactive/server/runtime/ResteasyReactiveRecorder.java#L95
with something like this:
public static final Supplier<Executor> EXECUTOR_SUPPLIER = new Supplier<>() {
@Override
public Executor get() {
return new Executor() {
@Override
public void execute(@NotNull Runnable command) {
if (!BlockingOperationControl.isBlockingAllowed() && VertxContext.isOnDuplicatedContext()) {
var ctx = Vertx.currentContext();
ctx.runOnContext(new Handler<Void>() {
@Override
public void handle(Void unused) {
ctx.executeBlocking(new Callable<Object>() {
@Override
public Object call() throws Exception {
command.run();
return null;
}
});
}
});
} else {
ExecutorRecorder.getCurrent().execute(command);
}
}
};
}
};
and the linked issue was gone, but it had other problems (the io.quarkus.resteasy.reactive.server.test.RequestLeakDetectionTest#testWithConcurrentBlockingCalls failed and I did not investigate further).
Could you kindly look at the code of this PR, linked description and suggest if you can, what is core issue, how far off is this PR and what would be proper way to deal with this situation? I understand it could take a long time, just put me in the right direction. Thanks
Also cc @geoand as it concerns Quarkus REST.
:waning_crescent_moon: This workflow status is outdated as a new workflow run has been triggered.
Status for workflow Quarkus CI
This is the status report for running Quarkus CI on commit e69022faab0018a07c33f4f41865abedca55034e.
:white_check_mark: The latest workflow run for the pull request has completed successfully.
It should be safe to merge provided you have a look at the other checks in the summary.
You can consult the Develocity build scans.
Flaky tests - Develocity
:gear: JVM Tests - JDK 17
:package: extensions/smallrye-reactive-messaging/deployment
:x: io.quarkus.smallrye.reactivemessaging.hotreload.ConnectorChangeTest.testUpdatingConnector - History
Expecting actual: ["-6","-7","-9","-10","-11","-12","-13","-14"] to start with: ["-6", "-7", "-8", "-9"]-java.lang.AssertionError
java.lang.AssertionError:
Expecting actual:
["-6","-7","-9","-10","-11","-12","-13","-14"]
to start with:
["-6", "-7", "-8", "-9"]
at io.quarkus.smallrye.reactivemessaging.hotreload.ConnectorChangeTest.testUpdatingConnector(ConnectorChangeTest.java:41)
I actually think this one is also related: https://github.com/quarkusio/quarkus/issues/40918
Because I use Quarkus Security as well and have seen both the error mentioned in this ticket and the other one
It would be nice if we had a test for the reported condition.
It would be nice if we had a test for the reported condition.
You are right. I don't think it will be a problem to create a test that reproduces the behavior for my workstation, I just wasn't sure how reliable it will be for slower machines etc. Anyway, no worry, once I have signal that this could be even remotely acceptable, I'll add test.
My guess is that re-scheduling same HTTP request when the Quarkus Security is present just because of this will not be acceptable. I kinda hope @jponge will put me in a direction how to combine contexts (one empty without SR Ctx propagation and other with it)....
@michalvavrik you might confuse me with @vietj who is the Vert.x maintainer
@michalvavrik you might confuse me with @vietj who is the Vert.x maintainer
Not really, but maybe I evaluated wrongly what I need? I thought the easiest way to fix this concurrency issue is to simply call routingContext.next() after Mutiny subscriber has completed because if the callback only does rc.next() I don't really need a propagation context for it, just to run on the right duplicated context. But the Mutiny subscriber callbacks are being intercepted by Mutiny interceptor provided by a SmallRye Context Propagation.
To put it simple, I don't see a way to avoid io.smallrye.mutiny.infrastructure.Infrastructure#UNI_INTERCEPTORS being applied on a Mutiny completion event. I thought you may have a thoughts because what we do in HttpSecurityRecorder must be a common use case. If not, no problem and thank you, I'll find another way how to fix it.
I'm not the best expert around that particular area (SR context propagation) but I'll try to have a look next week
I'm not the best expert around that particular area (SR context propagation) but I'll try to have a look next week
Thanks, I'd really appreciate to hear your thoughts, you don't need to solve it, just put me in the right direction :milky_way:
I eventually found this PR from @michalvavrik, while trying to find out if there are some security issues opened related to closing the OpenTelemetry scope warnings, so this PR looks relevant :-)
I'm not the best expert around that particular area (SR context propagation) but I'll try to have a look next week
I checked today how SR Ctx propagation works and I think we could apply wrappers that only use correct context if some flag is set to true and leverage try(CleanAutoCloseable ac = SmallRyeThreadContext.withThreadContext(threadContext)) and ThreadContext.ALL_REMAINING, but this situation seem to me like a Mutiny limitation:
- we need the context propagation for normal callbacks like
onItemtransformation (that is, on the sameUni) - but if the only thing that is happening in the subscriber
onItem(I am referring to the lastonItem) isroutingContext.next(), I need no CDI or Otel for that callback, the only thing that I need is thatroutingContext.next()is invoked on the correct duplicated context.
But if you think it is the context propagation issue, I can ask around. @FroMage and @manovotn worked on it, maybe you can give me a hint on what should I look into if I want to avoid the context propagation only for certain mutiny callbacks (those that use routingContext.next()), please?
It looks like you're saying that the problem is that we can call routingContext.next() from the wrong duplicated Vert.x context?
If yes, then indeed this has nothing to do with MP-CP, which does not deal with Vert.x context switching. This is a Vert.x question. Probably @cescoffier will have your answer.
It looks like you're saying that the problem is that we can call
routingContext.next()from the wrong duplicated Vert.x context?
@FroMage I know there is a lot of description in this PR so let me just highlight what I personally think is an issue:
- Vert.x HTTP authentication handler needs to subscribe to
Unito resolveSecurityIdentityand then callroutingContext.next()in the subscribedonItem - the
onItemcallback is wrapper by MutinyInterceptors which wraps all callbacks with SmallRye Context propagation, soorg.eclipse.microprofile.context.spi.ThreadContextSnapshot#beginis called beforeonItemis invoked andorg.eclipse.microprofile.context.spi.ThreadContextController#endContextis called after theonItemcallbck is executed - the
onItemcallback callsroutingContext.next()(so that is inside, before theendContext()invocation) - when there is switch between an IO thread and the worker thread [this part I am least sure about, it is really hard to determine the exact point], the
routingContext.next()invocation ends and theorg.eclipse.microprofile.context.spi.ThreadContextController#endContextis executed - hence, there are 2 codes reusing the same local data, one from the
org.eclipse.microprofile.context.spi.ThreadContextController#endContextand one from the REST endpoint user code
As far as the context propagation contexts is concerned, the org.eclipse.microprofile.context.spi.ThreadContextController#endContext cannot be executed while other code is executed because it will cause random behavior.
Hope it is bit clearer, let me know if it isn't.
And yeah, re-scheduling routingContext.next() fixes the issue, but I don't think it is the right thing to do as it probably has performance implications? I just wanted to show it so that we can find proper solution.
I'm not entirely sure to be honest. I suppose routingContext.next() actually continues the current request, rather than switch to a new request, so they should both have the same vertx duplicated context, no ? Unless the rest of the request is fired on a different thread pool?
Here's another lead: the way it works is that Mutiny, when you call onItem(Callback) will invoke BaseContextPropagationInterceptor.decorate(Callback) which will call context.contextualXXX(Callback) if the lambda is not already contextualised. This would capture the current context right now, and restore it when the callback is called.
If the lambda you pass to onItem() implements Contextualized then it won't be contextualised, it won't capture anything, and nothing will be restored. Which means it will inherit whatever thread locals are lying around. This could be what you're after if the next thread is new, or horrible if it has random left-over thread locals.
You can try that, see how things work.
Or (probably better) you can build a lambda to pass to onItem() that is properly contextualised for all cleared contexts (clean thread locals):
// you can save this one around for every time you need it
ThreadContext cleanThreadContext = ThreadContext.builder().propagated(ThreadContext.NONE).cleared(ThreadContext.ALL_REMAINING).build();
uni.onItem(cleanThreadContext.contextualCallable(() -> { rc.next(); }));
I'm not entirely sure to be honest. I suppose
routingContext.next()actually continues the current request, rather than switch to a new request, so they should both have the same vertx duplicated context, no ? Unless the rest of the request is fired on a different thread pool?
This is indeed how it works, it goes to the next handler in the chain, and there's not thread hop.
I'm not entirely sure to be honest. I suppose
routingContext.next()actually continues the current request, rather than switch to a new request, so they should both have the same vertx duplicated context, no ? Unless the rest of the request is fired on a different thread pool?This is indeed how it works, it goes to the next handler in the chain, and there's not thread hop.
I spend whole weekend debugging it, I realize that, but maybe I didn't stress out that Quarkus REST endpoint is executed on a worker thread and there is a switch between IO thread and the worker thread. Maybe I am just not able explain this easily, sorry :man_shrugging:
If the lambda you pass to onItem() implements Contextualized then it won't be contextualised, it won't capture anything, and nothing will be restored. Which means it will inherit whatever thread locals are lying around. This could be what you're after if the next thread is new, or horrible if it has random left-over thread locals.
That is super helpful @FroMage , thanks, I am going to re-read and try it tomorrow. That sounds exactly like what I need.
It definitely fixed the reproducer, thank you @FroMage (and @jponge )! I'll have to write a test as @brunobat reminded me and re-run CI if something else broken, but I'll ask you for a review if CI passes.
I have to put this PR away for a while because I actually need to get something done. I think the issue is serious in a sense that whenever is OpenTelemetry and Security used together and there is for example switch to the blocking executor like in blocking Quarkus REST endpoints, it can randomly switch to a wrong context. I'll contribute few other things to Quarkus and come back to it unless someone solves it sooner (I'd very much welcome that!).
I have pushed test @brunobat , read comments for more context.
Regarding fix. Yes, using ThreadContext cleanThreadContext = ThreadContext.builder().propagated(ThreadContext.NONE) .cleared(ThreadContext.ALL_REMAINING).build(); or ThreadContext.builder().propagated() .unchanged().cleared().build() fixes this issue, but the issue is that in this scenario, propagated contexts are CDI and OpenTelemetry. And whenever CDI is cleared and there was null CDI context previously, it activates a new CDI request context: https://github.com/quarkusio/quarkus/blob/cfe160745abf7bfcdc06b2e5b0e6e6465a8626c0/extensions/arc/runtime/src/main/java/io/quarkus/arc/runtime/context/ArcContextProvider.java#L85
Apart of the fact it fails most of the scenarios for current impl. (which could be fixed), we just don't want to activate it so early, it causes issue and is not necessary. I have experimented with unchanged and there are 2 other callbacks that also switch to the wrong context "randomly". It is hard to identify them as stacktrace is useless in this case, it is all callbacks of callbacks.
I've tweaked security code a little today, which as a side-effect would allow anyone to easily experiment with different thread contexts. These are 2 consumers you may want to experiment with:
- https://github.com/quarkusio/quarkus/blob/cfe160745abf7bfcdc06b2e5b0e6e6465a8626c0/extensions/vertx-http/runtime/src/main/java/io/quarkus/vertx/http/runtime/security/HttpSecurityRecorder.java#L395
- https://github.com/quarkusio/quarkus/blob/cfe160745abf7bfcdc06b2e5b0e6e6465a8626c0/extensions/vertx-http/runtime/src/main/java/io/quarkus/vertx/http/runtime/security/AbstractHttpAuthorizer.java#L68
The second one (the authorizer) is irrelevant if in your app are not HTTP permissions like in the reproducer (but unlike in the test I added in this PR).
@michalvavrik, I'll setup a meeting about this, if you don't mind.
@michalvavrik, I'll setup a meeting about this, if you don't mind.
I have a lot of release tasks this week, but 25 minutes is fine, thanks for your time.
Is there any progress on this?
Is there any progress on this?
Me and @brunobat met. This PR fixes the issue and fails at least one test (I mentioned extensions/resteasy-reactive/rest-jackson/deployment/src/test/java/io/quarkus/resteasy/reactive/jackson/deployment/test/streams/StreamTestCase.java before), but I cannot say if the test assumptions are correct. Personally I think they are wrong, that's why I asked @jponge about that test, because I have never seen AbstractMultiOperator and MultiOperatorProcessor. I didn't find if that is how demands should work.
Anyway, this PR in current form is basically workaround, it leverages the fact that the workload is scheduled after this local thread and duplicated context instance are not used. I cannot say if there are performance implications, if I have to guess I think they are, which is why I hoped @cescoffier will give me a hint if I am right.
As for the clean thread context suggestions, due to what I explained before - https://github.com/quarkusio/quarkus/pull/50310#issuecomment-3420010887 - I think that alone is not enough.
If I don't have any new input, I'll go back and give it one more try.
Switching randomly to a wrong context is definitely not good. The wrong context is used in Quarkus REST?
Probably @mkouba and @geoand should also be looking at this.
Switching randomly to a wrong context is definitely not good
Which context are we talking about?
Vert.x, I think. But honestly this issue has confused me in the past 😓
This is a summary from what Michal explained to me earlier:
- The same Vert.x duplicated context can be used concurrently
- Reproducer shows that at one moment OpenTelemetry context is A and on the next line of synchronous code inside Quarkus REST endpoint, it is another OpenTelemetry context B. OpenTelemetry wrapps our mutiny
onItemwhere security callsroutingContext.next()and that wrapper codeswitches the context into a wrong one. The issue is that OpenTelemetry context cannot count with the fact that someone else is using the same context.