quarkus icon indicating copy to clipboard operation
quarkus copied to clipboard

Quarkus Scheduler, when calling a token-protected endpoint (keycloak) fails to refresh the token

Open spirostz opened this issue 1 year ago • 6 comments

Describe the bug

Using Quarkus Scheduler, (@Scheduled). Used Latest Quarkus LTS 3.8.5

Inside scheduler method there is a RestClient calling an Token Protected API (eg. keycloak) When calling a token-protected endpoint (keycloak), fails to refresh the token after the token expiration see image.

I have FULLY reproduced the problem here, just git clone, START the application and wait 10 minutes: https://github.com/spirostz/quarkus_scheduler_token_problem

There is no problem when calling the same endpoint outside Quarkus Scheduler. Also reproduced in a test in the above repo.

image

Expected behavior

Inside scheduler method, all calls to a token-protected endpoint should never fail with 401 error

Actual behavior

Inside scheduler method, all calls to a token-protected endpoint should respond without error when

How to Reproduce?

1 Clone the https://github.com/spirostz/quarkus_scheduler_token_problem 2 Start the Application 3 Wait 10 minutes and you will see the error in the logs

Output of uname -a or ver

MINGW64_NT-10.0-22631 spirospc 3.3.3-341.x86_64 2022-01-17 11:45 UTC x86_64 Msys

Output of java -version

java version "17.0.8" 2023-07-18 LTS Java(TM) SE Runtime Environment (build 17.0.8+9-LTS-211) Java HotSpot(TM) 64-Bit Server VM (build 17.0.8+9-LTS-211, mixed mode, sharing)

Quarkus version or git rev

3.8.5

Build tool (ie. output of mvnw --version or gradlew --version)

Maven home: C:\tools\apache-maven-3.8.5-bin\apache-maven-3.8.5 Java version: 17.0.8, vendor: Oracle Corporation, runtime: C:\Program Files\Java\jdk-17 Default locale: en_US, platform encoding: Cp1252 OS name: "windows 11", version: "10.0", arch: "amd64", family: "windows"

Additional information

### logs containing the failure:

2024-06-26 13:07:48,052 INFO [org.tes.doc.DockerClientProviderStrategy] (build-42) Loaded org.testcontainers.dockerclient.NpipeSocketClientProviderStrategy from ~/.testcontainers.properties, will try it first 2024-06-26 13:07:48,262 INFO [org.tes.doc.DockerClientProviderStrategy] (build-42) Found Docker environment with local Npipe socket (npipe:////./pipe/docker_engine) 2024-06-26 13:07:48,264 INFO [org.tes.DockerClientFactory] (build-42) Docker host IP address is localhost 2024-06-26 13:07:48,279 INFO [org.tes.DockerClientFactory] (build-42) Connected to docker: Server Version: 26.1.4 API Version: 1.45 Operating System: Docker Desktop Total Memory: 11964 MB 2024-06-26 13:07:48,293 INFO [org.tes.ima.PullPolicy] (build-42) Image pull policy will be performed by: DefaultPullPolicy() 2024-06-26 13:07:48,295 INFO [org.tes.uti.ImageNameSubstitutor] (build-42) Image name substitution will be performed by: DefaultImageNameSubstitutor (composite of 'ConfigurationFileImageNameSubstitutor' and 'PrefixingImageNameSubstitutor') 2024-06-26 13:07:48,299 INFO [org.tes.DockerClientFactory] (build-42) Checking the system... 2024-06-26 13:07:48,300 INFO [org.tes.DockerClientFactory] (build-42) ?? Docker server version should be at least 1.6.0 2024-06-26 13:07:48,323 INFO [io.qua.oid.dep.dev.key.KeycloakDevServicesProcessor] (build-42) Using Quarkus powered Keycloak distribution 2024-06-26 13:07:48,343 INFO [tc.qua.io/.0.4] (build-42) Creating container for image: quay.io/keycloak/keycloak:24.0.4 2024-06-26 13:07:48,540 INFO [org.tes.uti.RegistryAuthLocator] (build-42) Credential helper/store (docker-credential-desktop) does not have credentials for quay.io 2024-06-26 13:07:48,548 INFO [tc.tes.6.0] (build-42) Creating container for image: testcontainers/ryuk:0.6.0 2024-06-26 13:07:48,620 INFO [org.tes.uti.RegistryAuthLocator] (build-42) Credential helper/store (docker-credential-desktop) does not have credentials for https://index.docker.io/v1/ 2024-06-26 13:07:48,741 INFO [tc.tes.6.0] (build-42) Container testcontainers/ryuk:0.6.0 is starting: 05f3bb0109b1548459d17c5443e581751ea5b3d17e9e79abc461924137459be0 2024-06-26 13:07:49,326 INFO [tc.tes.6.0] (build-42) Container testcontainers/ryuk:0.6.0 started in PT0.777672S 2024-06-26 13:07:49,404 INFO [tc.qua.io/.0.4] (build-42) Container quay.io/keycloak/keycloak:24.0.4 is starting: 621159e8996b1bfc69a27f226783bd47dad769e01496c13be5ed26e790461524 2024-06-26 13:08:11,118 INFO [tc.qua.io/.0.4] (build-42) Container quay.io/keycloak/keycloak:24.0.4 started in PT22.7750323S 2024-06-26 13:08:13,555 INFO [io.qua.oid.dep.dev.key.KeycloakDevServicesProcessor] (build-42) Dev Services for Keycloak started.


--/ __ / / / / _ | / _ / /// / / / __/ -/ // / // / __ |/ , / ,< / // /\ \
--___
// |//|//||_//
2024-06-26 13:08:14,874 INFO [io.quarkus] (Quarkus Main Thread) code-with-quarkus 1.0.0-SNAPSHOT on JVM (powered by Quarkus 3.8.5) started in 28.167s. Listening on: http://localhost:8080 2024-06-26 13:08:14,876 INFO [io.quarkus] (Quarkus Main Thread) Profile dev activated. Live Coding activated. 2024-06-26 13:08:14,877 INFO [io.quarkus] (Quarkus Main Thread) Installed features: [cdi, keycloak-authorization, oidc, oidc-client, oidc-client-filter, resteasy, resteasy-client, resteasy-client-jackson, scheduler, security, smallrye-context-propagation, vertx] 2024-06-26 13:08:30,200 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:09:00,004 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:09:30,008 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:10:00,000 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:10:30,006 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:11:00,014 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:11:30,008 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:12:00,007 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:12:30,001 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:13:00,011 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:13:30,012 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:14:00,005 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:14:30,009 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:15:00,010 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:15:30,014 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:16:00,003 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:16:30,002 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:17:00,002 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:17:30,000 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:18:00,000 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:18:30,004 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:18:30,022 ERROR [io.qua.sch.com.run.StatusEmitterInvoker] (vert.x-worker-thread-1) Error occurred while executing task for trigger CronTrigger [id=1_org.acme.Schedule#schedule, cron=0/30 * * * * ?, gracePeriod=PT1S, timeZone=null]: java.util.concurrent.CompletionException: jakarta.ws.rs.WebApplicationException: Unknown error, status code 401 at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332) at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347) at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:874) at java.base/java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:887) at java.base/java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2325) at java.base/java.util.concurrent.CompletableFuture$MinimalStage.whenComplete(CompletableFuture.java:2902) at io.quarkus.scheduler.common.runtime.DefaultInvoker.invoke(DefaultInvoker.java:24) at io.quarkus.scheduler.common.runtime.StatusEmitterInvoker.invoke(StatusEmitterInvoker.java:35) at io.quarkus.scheduler.runtime.SimpleScheduler$ScheduledTask.doInvoke(SimpleScheduler.java:443) at io.quarkus.scheduler.runtime.SimpleScheduler$ScheduledTask$2.call(SimpleScheduler.java:425) at io.quarkus.scheduler.runtime.SimpleScheduler$ScheduledTask$2.call(SimpleScheduler.java:422) at io.vertx.core.impl.ContextImpl.lambda$executeBlocking$0(ContextImpl.java:178) at io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:279) at io.vertx.core.impl.ContextImpl.lambda$internalExecuteBlocking$2(ContextImpl.java:210) at org.jboss.threads.ContextHandler$1.runWith(ContextHandler.java:18) at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2513) at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1538) at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29) at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: jakarta.ws.rs.WebApplicationException: Unknown error, status code 401 at org.jboss.resteasy.microprofile.client.DefaultResponseExceptionMapper.toThrowable(DefaultResponseExceptionMapper.java:42) at org.jboss.resteasy.microprofile.client.ExceptionMapping$HandlerException.mapException(ExceptionMapping.java:60) at io.quarkus.restclient.runtime.QuarkusProxyInvocationHandler.invoke(QuarkusProxyInvocationHandler.java:172) at jdk.proxy5/jdk.proxy5.$Proxy90.call(Unknown Source) at org.acme.Schedule.schedule(Schedule.java:17) at org.acme.Schedule_ClientProxy.schedule(Unknown Source) at org.acme.Schedule_ScheduledInvoker_schedule_dce0524c3f1de653f83719900424446459b7c471.invokeBean(Unknown Source) ... 15 more

2024-06-26 13:19:00,005 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:19:30,013 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client 2024-06-26 13:20:00,009 INFO [org.acm.Schedule] (vert.x-worker-thread-1) Calling client

spirostz avatar Jun 26 '24 10:06 spirostz

/cc @manovotn (scheduler), @mkouba (scheduler), @pedroigor (keycloak), @sberyozkin (keycloak)

quarkus-bot[bot] avatar Jun 26 '24 10:06 quarkus-bot[bot]

@spirostz I see you use Keycloak Dev Service, with default realm properties, and I can only guess that after 10 mins, a limit on a number of times a refresh token can be used is reached, or the refresh token's lifetime may be set equal to the access token's lifetime (FYI, I don't think it is possible to directly control a RT lifetime using Keycloak admin API).

I think the best option where it can be kept under control for a long time, while using the Dev Service, is to create a custom realm with the required properties and use quarkus.keycloak.devservice.realm-path property to get it loaded.

Can you please try to update https://github.com/quarkusio/quarkus/blob/main/extensions/oidc/deployment/src/main/java/io/quarkus/oidc/deployment/devservices/keycloak/KeycloakDevServicesProcessor.java#L777, to say 100 and see if it can help, we can set it to a high enough value by default if it makes a difference. I can work with your reproducer, but I have quite a few other issues in front of me right now

sberyozkin avatar Jun 26 '24 15:06 sberyozkin

@sberyozkin Thank you for you response, I did the change through the GUI and the problem is still there.

Network data exchanged: image

Keep in mind that I used Keycloak Dev Service just to replicate the problem and keep the reproduction setup minimal (plug n play approach). The problem identified on a production environment with production ready keycloak server.

There is no problem when calling the same endpoint outside Quarkus Scheduler, so I believe is something related to scheduler.

If you find sometime to reproduce it with my repo:

  • Starting the app and wait 10 min will reproduce the problem (Scheduler).
  • Just running the test there, confirms that there is no problem if you do the same calls outside Scheduler.

spirostz avatar Jun 26 '24 15:06 spirostz

@spirostz OK, I see, Can you please add quarkus.keycloak.devservice.show-logs=true and check what error is reported at the Keycloak end ?

sberyozkin avatar Jun 26 '24 16:06 sberyozkin

@sberyozkin I enabled the logs quarkus.keycloak.devservices.show-logs=true but nothing logged before or after the error.

I confirmed that logs worked since I saw them during the keycloak startup eg [org.keycloak.services] (main) KC-SERVICES0050: Initializing master realm

I pushed the new property to my repo.

image

spirostz avatar Jun 26 '24 16:06 spirostz

@spirostz Thanks, I'll have a closer look next week. Let me ask you to check something else in the meantime...

Can you confirm, that when Scheduler runs the client, the access token is reaching the protected endpoint, before it expires ?

Can you add a refresh token time skew property, quarkus.oidc-client.refresh-token-time-skew, and set it to a value larger than the scheduler interval, for ex, if it runs every 30 secs, set the skew to 70 secs... One possibility, is that a nearly expired token is sent to the target and by the time it arrives it has expired. So this property will ensure a nearly expired token gets refreshed

sberyozkin avatar Jun 27 '24 16:06 sberyozkin

@sberyozkin By setting the quarkus.oidc-client.refresh-token-time-skew=70s there was no error.

spirostz avatar Jul 01 '24 11:07 spirostz

@spirostz Ok, thanks, in that case, do you agree we can close this issue ? OIDC client does not refresh the token which is still valid, while the refresh token time skew property is taking care of the edge cases where the token might expire immediately after it was propagated. OidcClient itself can not rewind the request chain if the error is returned, which I guess can be indirectly resolved by using Fault tolerance annotations only...

sberyozkin avatar Jul 01 '24 14:07 sberyozkin

Thank you @sberyozkin, even the quarkus.oidc-client.refresh-token-time-skew=1s (1 second) fixes the problem. So, as far as my problem, I can use this workaround and will consider the issue closed.

Now, as far as the Quarkus bug perspective, I made some tests and it is not an edge case. If you are inside a scheduler method the call will always fail after the token expires, even if the expiration happened long time ago.

So, as you understand, everyone that is implementing calls against token-protected endpoints through a short period scheduler method (less that token's lifecycle) will experience this problem.

Now, considering that this never happened outside scheduler methods, I believe there is a scheduler related problem there.

spirostz avatar Jul 02 '24 09:07 spirostz

I don't think scheduler does any token related handling here and so far I didn't spot anything in the code that would indicate that. But maybe I am missing something; @mkouba thoughts?

manovotn avatar Jul 02 '24 09:07 manovotn

Hi @spirostz,

Sure, sorry, I meant to say, that the situation where a token is about to expire very shortly is that kind of an edge case where you can't do anything useful but hope for the best. It has been reported before, for cases not even involving the scheduler, which is why we brought this option of the proactive refreshment which brings the probability of OidcClient sending a valid token which becomes invalid a few milliseconds/secs later to nearly 0.

I'm certain it is a coincidence that your issue is reproducible from the scheduler, it runs every 30 seconds, so, when the timer has run for 10 minutes, it gets a valid token which will exp in 1 millisec or something like that... If you set it to something like 37, such that it rarely if ever gets to an even number like 600 seconds, then you should see a different picture though the chance of missing out on refreshing the token would still be there.

So configuring a refresh time skew is a good idea in general.

sberyozkin avatar Jul 02 '24 10:07 sberyozkin

Hi @sberyozkin,

I tried the 37 sec approach and failed again. Whatever cron I setup that is less than a 10 min period replicates the error.

image

spirostz avatar Jul 02 '24 12:07 spirostz

@spirostz Thanks, but at the same time, having a proactive refreshment resolves the problem. It looks like it is related to the mechanics of how Scheduler works, meaning that this situation with a nearly expired token being forwarded to the target is very likely to happen...

But we'll get to the bottom of it, can you please give me a favor and enable the debug logging for io.quarkus.oidc.runtime, let's check what error is reported when OIDC client sends a token to GreeterResource

Thanks

sberyozkin avatar Jul 02 '24 12:07 sberyozkin

@sberyozkin You are correct.

The way I setup the cron 0/37 * * * * ? always the makes sure that the token will expire the exact same time as the the schedule will run. Check the times on my previous image start: 14:46:37, error: 14:56:37. Also, it is not every 37sec if you check.

To actually be able to setup a 37 interval, I used the @Scheduled(every = "37s"). By doing this, the error disappeared. To be honest I did not find a way to setup an actual cron to run every 37 sec, I tried this */37 * * * * ? but did not work.

Also explains why the 1s skew worked.

So, @sberyozkin really thank you for the time spend to help me identify the problem!

spirostz avatar Jul 02 '24 12:07 spirostz

Hey @spirostz Np at all, glad you've figured it out... I'll add some extra logging though related to tracing how OIDC client evaluates the expiry time, to make it easier to identify such issues easier in the future, and then I'll have the issue resolved with that PR, thanks

sberyozkin avatar Jul 02 '24 13:07 sberyozkin

To be honest I did not find a way to setup an actual cron to run every 37 sec, I tried this */37 * * * * ? but did not work.

@mkouba @manovotn, Hi Martin, Matej, if you can recommend something, please do...

sberyozkin avatar Jul 02 '24 13:07 sberyozkin

To actually be able to setup a 37 interval, I used the @Scheduled(every = "37s").

This is how I'd set it up. I don't think you can do it with cron, the lowest it can express is every min IIRC.

I tried this */37 * * * * ? but did not work.

This should be every 37th min.

manovotn avatar Jul 02 '24 13:07 manovotn

To actually be able to setup a 37 interval, I used the @scheduled(every = "37s").

This is how I'd set it up. I don't think you can do it with cron, the lowest it can express is every min IIRC.

I tried this */37 * * * * ? but did not work.

This should be every 37th min.

I also don't think you can use a cron expr to execute something every 37 seconds. The slash / is used to define start and step value, i.e. it cannot be used to define "periods". In Quartz, 0/37 * * * * ? and */37 * * * * ? would mean that the trigger fires on the 0th and 37th second of every minute... e.g. 12:00:00, 12:00:37, 12:01:00, 12:01:37, ...

mkouba avatar Jul 02 '24 16:07 mkouba