quarkus icon indicating copy to clipboard operation
quarkus copied to clipboard

GraphQL spurious authentication failures

Open nvrs opened this issue 2 years ago • 19 comments

Describe the bug

After upgrading from 2.9.2.Final, requests to the GraphQL API for secured resources (queries / mutations) may spuriously fail with an "unauthorized" error-code.

I have created a simple Quarkus GraphQL project that uses the the https://quarkus.io/guides/microprofile-graphql as a base to exhibit / reproduce the issue. Specifically, the quarkus GraphQL example has been modified by adding an extra method to the API (allFilmsSecured) that's secured using the standard @Authenticated annotation, and leveraged the quarkus-elytron-security-properties-file extension to configure a set of valid user credentials.

Expected behavior

When querying the secured endpoint with the correct credentials (HTTP Basic Auth) concurrenctly with / without other unsecured endpoints, one should not receive unauthorized error-code responses.

Actual behavior

Given enough concurrent requests (tests are configured to 50 but failures occur with fewer too), some will fail with unauthorized error-code. What's also suprising is that the unexpected behaviour is not exactly the same for all versions > 2.9.2.Final. Specifically, for versions up to 2.10.3.Final the bug seems to occur only when there are concurrent requests with different security credentials and / or a combination of resources with different security restrictions. However on 2.10.4.Final and 2.11.1.Final failures are observed even when loading the service with concurrent requests for the same resource with the exact same credentials.

How to Reproduce?

Executing the Junit tests in ConcurrentAuthTest reproduces the issue on two machines that I have tried it out (8 cores / 16 threads - Ryzen 58000x, and 4 cores / 8 threads intel i7 kaby lake). On the test class there are two tests, one that attempts concurrent requests on both secured and unsecured resources (that fails on all versions > 2.9.2.Final), and another that only performs concurrent requests on the secured resource that fails only on 2.10.4.Final and 2.11.1.Final.

A JMeter Test Plan is also included that reproduces the issue very reliably (JSON Assertions fail when receiving unauthorized error codes).

Output of uname -a or ver

Linux fedora-desktop 5.18.13-200.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Jul 22 14:03:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Output of java -version

openjdk version "17.0.2" 2022-01-18 OpenJDK Runtime Environment Temurin-17.0.2+8 (build 17.0.2+8) OpenJDK 64-Bit Server VM Temurin-17.0.2+8 (build 17.0.2+8, mixed mode, sharing)

GraalVM version (if different from Java)

No response

Quarkus version or git rev

No response

Build tool (ie. output of mvnw --version or gradlew --version)

Apache Maven 3.8.4

Additional information

No response

nvrs avatar Jul 31 '22 09:07 nvrs

/cc @jmartisk, @phillip-kruger

quarkus-bot[bot] avatar Jul 31 '22 09:07 quarkus-bot[bot]

This might be related to https://github.com/quarkusio/quarkus/issues/26748

nvrs avatar Jul 31 '22 10:07 nvrs

Yeah that's definitely related to it. @geoand we might need your brain on this one, looks like we have trouble fixing this issue.

gsmet avatar Jul 31 '22 11:07 gsmet

I already discussed with @geoand on Friday he gave me a few idea to go and look for. I'll continue tomorrow on this.

phillip-kruger avatar Jul 31 '22 11:07 phillip-kruger

I also just thought of something else : maybe somewhere in the code a future / promise return value is being ignored?

geoand avatar Jul 31 '22 13:07 geoand

@geoand I'll check, why would that be an issue ?

phillip-kruger avatar Jul 31 '22 23:07 phillip-kruger

Just thinking it could create a race

geoand avatar Aug 01 '22 00:08 geoand

I pinged @stuartwdouglas , he is helping out. I am hopeful we will get this fixed soon.

phillip-kruger avatar Aug 01 '22 00:08 phillip-kruger

@nvrs could you try the latest snapshots to make sure we fixed it?

See https://github.com/quarkusio/quarkus/blob/main/CONTRIBUTING.md#using-snapshots for the instructions.

I'll release new versions once I get your feedback.

Thanks!

gsmet avatar Aug 02 '22 10:08 gsmet

The issue is still not entirely fixed. If you're exposing GraphQL services, stay with 2.9.x for now. We will follow up once we are positive the issue is entirely fixed.

gsmet avatar Aug 05 '22 07:08 gsmet

@nvrs could you try the latest snapshots to make sure we fixed it?

See https://github.com/quarkusio/quarkus/blob/main/CONTRIBUTING.md#using-snapshots for the instructions.

I'll release new versions once I get your feedback.

Thanks!

@gsmet Just saw your reply, sure I can, let me know when to give it another go.

nvrs avatar Aug 08 '22 15:08 nvrs

@nvrs I think we got the latest fixes in today. I will ping you tomorrow when the new snapshots are online.

gsmet avatar Aug 08 '22 15:08 gsmet

@nvrs the snapshots have been deployed so you can test them by following the instructions here: https://github.com/quarkusio/quarkus/blob/main/CONTRIBUTING.md#using-snapshots .

gsmet avatar Aug 09 '22 09:08 gsmet

@gsmet, will do and report back

nvrs avatar Aug 09 '22 09:08 nvrs

@gsmet I am going to build from source (main branch) instead of using snapshots.

nvrs avatar Aug 09 '22 12:08 nvrs

@nvrs as you wish. We have instructions for that too: https://github.com/quarkusio/quarkus/blob/main/CONTRIBUTING.md#building-main :)

gsmet avatar Aug 09 '22 13:08 gsmet

The jmeter test that so far was consistently reproducing the issue on the example code (100% rate within a couple of hundred requests up until 2.11.1.Final) doesn't do so with the latest build. However, I also tried 2.11.2.Final -- which according to your release notes still exhibits the issue -- and it also passes; Thus the specific test harness is not reliable.

I will try it on our staging env with real code and a more representative load test as soon as possible.

nvrs avatar Aug 09 '22 14:08 nvrs

2.11.2 might have fixed your case, the PRs that followed fix similar cases but for the Graphql context

phillip-kruger avatar Aug 09 '22 22:08 phillip-kruger

We cannot reproduce the issue as described in the ticket onward from 2.11.2.Final.

Since we do not use directly the GraphQL context we haven't observed any related issue, nor can create a reproducer. We can thus consider the issue fixed.

Should I close or is it preferable to wait / you do it?

nvrs avatar Aug 10 '22 12:08 nvrs