Deadlock in Infinispan/Keycloak when running in A/A mode
Describe the bug
Scenario
- Create 4 clients
- Login to 1 client
- Make 3 parallel login requests to the remaining clients (use authorization code flow (ACF) and make sure all requests have correct cookies so this results in not creating a new user session but adding new client sessions to the existing user session created in step 2.)
Note the authorization code flow consists of 2 requests.
- Creates a new client session, updates the user session with the new client session relationship and then returns a code
- Based on the code searches for the user session and with the corresponding user session and return code
The test case implementation: https://github.com/mhajas/keycloak-benchmark/commit/19ac181408532cb453cd8a612db18a6dfe911b7c#diff-f2bc80ecb43ebc60741ae07893957eadf5ee1a552949320537fd8cb674f8efb7R98-R174
To reproduce the failure it is necessary to run the test more times in a row (I am using Intellij feature to run the test until failure). To make it reproducible more often increase the number of parallel requests, I used only three because then it produces less logs.
When the failure occurs one of the 3 parallel requests fails when turning code to token because it finds a user session without the relationship information to the client session that is created in request 2 of the ACF.
The failure event:
{
"timestamp": "2024-03-06T14:04:58.354120738Z",
"sequence": 15934,
"loggerClassName": "org.jboss.logging.Logger",
"loggerName": "org.keycloak.events",
"level": "WARN",
"message": "type=\"CODE_TO_TOKEN_ERROR\", realmId=\"0045d523-1d52-4324-becc-12cb88d86195\", clientId=\"client-0\", userId=\"null\", ipAddress=\"10.128.2.13\", error=\"user_session_not_found\", grant_type=\"authorization_code\", code_id=\"252c47c9-b375-4f35-9ad3-e13074a63e34\", client_auth_method=\"client-secret\"",
"threadName": "executor-thread-8",
"threadId": 135,
"mdc": {
"trace_flags": "00",
"trace_id": "6d785db2978fe59a9d0e3c8854d46cfc",
"span_id": "a98a2fda4b84e881"
},
"ndc": "",
"hostName": "keycloak-0",
"processName": "QuarkusEntryPoint",
"processId": 1
}
Don't get confused by the error user_session_not_found I can confirm the user session is found, however, it does not have the correct client client session.
Keycloak setup
Using rosa-cross-dc with the following .env
# GENERIC
TIMEOUT=1200 # 20 minutes in seconds - 300 may be too low if we need to wait for machines to be provisioned
ROSA_CLUSTER_NAME_1=gh-keycloak-a
ROSA_CLUSTER_NAME_2=gh-keycloak-b
# AURORA
AURORA_CLUSTER=mhajas-aurora
# ROUTE53
KC_CLIENT_URL=client.gh-keycloak-a-gh-keycloak-b-mze0mzik.keycloak-benchmark.com
KC_HEALTH_URL_CLUSTER_1=primary.gh-keycloak-a-gh-keycloak-b-mze0mzik.keycloak-benchmark.com
KC_HEALTH_URL_CLUSTER_2=backup.gh-keycloak-a-gh-keycloak-b-mze0mzik.keycloak-benchmark.com
# INFINISPAN
KC_ISPN_NAMESPACE=mhajas-keycloak
CROSS_DC_MODE=SYNC
# KEYCLOAK
KC_NAMESPACE_PREFIX='mhajas-'
KC_DATABASE=aurora-postgres
KC_DB_POOL_MIN_SIZE=15
KC_DB_POOL_MAX_SIZE=15
KC_DB_POOL_INITIAL_SIZE=15
KC_STORAGE=
KC_INSTANCES=1
KC_DISABLE_STICKY_SESSION='true'
KC_MEMORY_REQUESTS_MB=2048
KC_MEMORY_LIMITS_MB=3000
KC_HEAP_MAX_MB=2048
KC_HEAP_INIT_MB=1024
KC_CPU_REQUESTS=8
KC_CPU_LIMITS=8
KC_OTEL='true'
KC_CUSTOM_INFINISPAN_CONFIG_FILE=config/kcb-infinispan-cache-remote-store-config.xml
To get more Keycloak logs I updated Keycloak resource with the following additionalOptions:
spec:
additionalOptions:
- name: log-level
value: org.keycloak.models.sessions.infinispan:trace
Keycloak is doing the following when updating Remote Infinispan: https://github.com/keycloak/keycloak/blob/39299eeb38df999816b49dab82ac8afe083458c8/model/infinispan/src/main/java/org/keycloak/models/sessions/infinispan/remotestore/RemoteCacheInvoker.java#L147-L181
This is working well in most cases but in case of failure I can see the following:
The failure scenario details
Clients used in the test and where the session was created
cross-dc-test-client - 7673f26a-bc7d-4f2b-89b8-8386466cb125 -> DC1
client-0 - 42d9cce8-4538-4b8f-a289-81945cd20118 -> DC2
client-1 - 40320f36-506e-4ef4-8db3-018cba236a1f -> DC1
client-2 - 07a5e78d-0891-401b-8b84-c73299c78811 -> DC1
User session id: 252c47c9-b375-4f35-9ad3-e13074a63e34
Failed login is for client-0 - 42d9cce8-4538-4b8f-a289-81945cd20118 -> DC2
Logs
There are a lot of logs produced in both DCs so I won't paste here all logs. We can go through it together if anyone is interested. Here is the most suspicious thing I can see:
DC1
{"timestamp":"2024-03-06T14:04:57.462915125Z","sequence":17385,"loggerClassName":"org.jboss.logging.Logger","loggerName":"org.keycloak.models.sessions.infinispan.remotestore.RemoteCacheInvoker","level":"TRACE","message":"Node name: keycloak-0-31646, Site name: gh-keycloak-a, replaceIteration: 1: Replaced entity version 101 in remote cache: UserSessionEntity [id=252c47c9-b375-4f35-9ad3-e13074a63e34, realm=0045d523-1d52-4324-becc-12cb88d86195, lastSessionRefresh=1709733893, clients=[07a5e78d-0891-401b-8b84-c73299c78811, 40320f36-506e-4ef4-8db3-018cba236a1f, 7673f26a-bc7d-4f2b-89b8-8386466cb125]]","threadName":"executor-thread-8","threadId":135,"mdc":{"trace_flags":"00","trace_id":"05970012dfed3d757bcdc8f2c9bc23b8","span_id":"1435f66b59bd830b"},"ndc":"","hostName":"keycloak-0","processName":"QuarkusEntryPoint","processId":1}
DC2
{"timestamp":"2024-03-06T14:04:57.460878962Z","sequence":15897,"loggerClassName":"org.jboss.logging.Logger","loggerName":"org.keycloak.models.sessions.infinispan.remotestore.RemoteCacheInvoker","level":"TRACE","message":"Node name: keycloak-0-65362, Site name: gh-keycloak-b, replaceIteration: 2: Replaced entity version 101 in remote cache: UserSessionEntity [id=252c47c9-b375-4f35-9ad3-e13074a63e34, realm=0045d523-1d52-4324-becc-12cb88d86195, lastSessionRefresh=1709733893, clients=[40320f36-506e-4ef4-8db3-018cba236a1f, 42d9cce8-4538-4b8f-a289-81945cd20118, 7673f26a-bc7d-4f2b-89b8-8386466cb125]]","threadName":"executor-thread-8","threadId":135,"mdc":{"trace_flags":"00","trace_id":"678d828f31beaf4c6e2f39d93e2a9f22","span_id":"82b845e2170c9980"},"ndc":"","hostName":"keycloak-0","processName":"QuarkusEntryPoint","processId":1}
From the above, it is visible both sites succeed with the update of the same entity in version 101. Note:
- The time difference between these two log messages is only 2ms
- Also notice the difference between
clientswithin the message
Then both DCs submit an event:
CLIENT_CACHE_ENTRY_MODIFIED, key=252c47c9-b375-4f35-9ad3-e13074a63e34, version=102
I can see the processing of the event in DC2 (This is printed here):
{"timestamp":"2024-03-06T14:04:57.502877648Z","sequence":15909,"loggerClassName":"org.jboss.logging.Logger","loggerName":"org.keycloak.models.sessions.infinispan.remotestore.RemoteCacheSessionListener","level":"DEBUG","message":"Read session entity from the remote cache: UserSessionEntity [id=252c47c9-b375-4f35-9ad3-e13074a63e34, realm=0045d523-1d52-4324-becc-12cb88d86195, lastSessionRefresh=1709733893, clients=[07a5e78d-0891-401b-8b84-c73299c78811, 40320f36-506e-4ef4-8db3-018cba236a1f, 7673f26a-bc7d-4f2b-89b8-8386466cb125]] . replaceRetries=1","threadName":"kc-client-listener-sessions-924-3","threadId":120,"mdc":{},"ndc":"","hostName":"keycloak-0","processName":"QuarkusEntryPoint","processId":1}
The missing client id is the one that fails the login request 42d9cce8-4538-4b8f-a289-81945cd20118
I also tried pulling some info from Infinispan logs but I don't see anything that would give me more info about what happened. We probably need to enable some additional logging levels.
DC1
March 6th 2024, 15:04:57.458 infinispan-1 14:04:57,457 ERROR (non-blocking-thread--p2-t11) [org.infinispan.interceptors.impl.InvocationContextInterceptor] ISPN000136: Error executing command PutKeyValueCommand on Cache 'sessions', writing keys [WrappedByteArray[03\>\$\2\5\2\c\4\7\c\9\-\b\3\7\5\-\4\f\3\5\-\9\a\d\3\-\e\1\3\0\7\4\a\6\3\e\3\4 (39 bytes)]] org.infinispan.util.concurrent.TimeoutException: ISPN000299: Unable to acquire lock after 4 seconds for key WrappedByteArray[03\>\$\2\5\2\c\4\7\c\9\-\b\3\7\5\-\4\f\3\5\-\9\a\d\3\-\e\1\3\0\7\4\a\6\3\e\3\4 (39 bytes)] and requestor CommandInvocation:infinispan-1-5398:9129. Lock is held by CommandInvocation:infinispan-1-5398:9127
March 6th 2024, 15:04:57.458 infinispan-1 at org.infinispan.util.concurrent.locks.impl.DefaultLockManager$KeyAwareExtendedLockPromise.get(DefaultLockManager.java:299)
March 6th 2024, 15:04:57.458 infinispan-1 at org.infinispan.util.concurrent.locks.impl.DefaultLockManager$KeyAwareExtendedLockPromise.get(DefaultLockManager.java:229)
March 6th 2024, 15:04:57.458 infinispan-1 at org.infinispan.util.concurrent.locks.impl.InfinispanLock$LockPlaceHolder.checkState(InfinispanLock.java:440)
March 6th 2024, 15:04:57.458 infinispan-1 at org.infinispan.util.concurrent.locks.impl.InfinispanLock$LockPlaceHolder.lambda$toInvocationStage$3(InfinispanLock.java:416)
March 6th 2024, 15:04:57.458 infinispan-1 at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
March 6th 2024, 15:04:57.458 infinispan-1 at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482)
March 6th 2024, 15:04:57.458 infinispan-1 at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
March 6th 2024, 15:04:57.458 infinispan-1 at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
March 6th 2024, 15:04:57.458 infinispan-1 at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
March 6th 2024, 15:04:57.458 infinispan-1 at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:413)
March 6th 2024, 15:04:57.458 infinispan-1 at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
March 6th 2024, 15:04:57.458 infinispan-1 at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
March 6th 2024, 15:04:57.458 infinispan-1 at java.base/java.lang.Thread.run(Thread.java:840)
March 6th 2024, 15:04:57.458 infinispan-1
March 6th 2024, 15:04:57.458 infinispan-1 14:04:57,458 WARN (non-blocking-thread--p2-t11) [org.infinispan.CLUSTER] ISPN000071: Caught exception when handling command SingleXSiteRpcCommand{command=PutKeyValueCommand{key=WrappedByteArray[03\>\$\2\5\2\c\4\7\c\9\-\b\3\7\5\-\4\f\3\5\-\9\a\d\3\-\e\1\3\0\7\4\a\6\3\e\3\4 (39 bytes)], value=WrappedByteArray[03040B000000\D\o\r\g\.\k\e\y\c\l\o\a\k\.\m\o\d\e\l\s\.\s\e\s\s\i\o\n\s\.\i\n\f\i\n\i\s\p\a\n\.\c\h\a\n\g\e\s\.\S\e\s\s\i\o\n\E\n... (1429 bytes)], flags=[], commandInvocationId=CommandInvocation:bb71181d-ab9b-419b-a642-db5dea8ce432:6613, putIfAbsent=false, returnEntry=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedExpirableMetadata{version=NumericVersion{version=71}, lifespan=35999612, maxIdle=37800000}, internalMetadata=null, successful=true, topologyId=-1}} org.infinispan.util.concurrent.TimeoutException: ISPN000299: Unable to acquire lock after 4 seconds for key WrappedByteArray[03\>\$\2\5\2\c\4\7\c\9\-\b\3\7\5\-\4\f\3\5\-\9\a\d\3\-\e\1\3\0\7\4\a\6\3\e\3\4 (39 bytes)] and requestor CommandInvocation:infinispan-1-5398:9129. Lock is held by CommandInvocation:infinispan-1-5398:9127
March 6th 2024, 15:04:57.458 infinispan-1 at org.infinispan.util.concurrent.locks.impl.DefaultLockManager$KeyAwareExtendedLockPromise.get(DefaultLockManager.java:299)
March 6th 2024, 15:04:57.458 infinispan-1 at org.infinispan.util.concurrent.locks.impl.DefaultLockManager$KeyAwareExtendedLockPromise.get(DefaultLockManager.java:229)
March 6th 2024, 15:04:57.458 infinispan-1 at org.infinispan.util.concurrent.locks.impl.InfinispanLock$LockPlaceHolder.checkState(InfinispanLock.java:440)
March 6th 2024, 15:04:57.458 infinispan-1 at org.infinispan.util.concurrent.locks.impl.InfinispanLock$LockPlaceHolder.lambda$toInvocationStage$3(InfinispanLock.java:416)
March 6th 2024, 15:04:57.458 infinispan-1 at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
March 6th 2024, 15:04:57.458 infinispan-1 at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482)
March 6th 2024, 15:04:57.458 infinispan-1 at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
March 6th 2024, 15:04:57.458 infinispan-1 at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
March 6th 2024, 15:04:57.458 infinispan-1 at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
March 6th 2024, 15:04:57.458 infinispan-1 at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:413)
March 6th 2024, 15:04:57.458 infinispan-1 at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
March 6th 2024, 15:04:57.458 infinispan-1 at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
March 6th 2024, 15:04:57.458 infinispan-1 at java.base/java.lang.Thread.run(Thread.java:840)
March 6th 2024, 15:04:57.458 infinispan-1
DC2
March 6th 2024, 15:04:57.460 infinispan-0 14:04:57,460 WARN (jgroups-160,infinispan-0-58913) [org.infinispan.xsite.BackupSenderImpl] ISPN000202: Encountered issues while backing up data for cache sessions to site gh-keycloak-a org.infinispan.remoting.RemoteException: ISPN000217: Received exception from infinispan-1-5398:gh-keycloak-a, see cause for remote stack trace
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.remoting.transport.ResponseCollectors.wrapRemoteException(ResponseCollectors.java:26)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.remoting.transport.ValidSingleResponseCollector.withException(ValidSingleResponseCollector.java:37)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.remoting.transport.ValidSingleResponseCollector.addResponse(ValidSingleResponseCollector.java:21)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.remoting.transport.jgroups.SingleSiteRequest.receiveResponse(SingleSiteRequest.java:50)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.remoting.transport.jgroups.SingleSiteRequest.onResponse(SingleSiteRequest.java:32)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.remoting.transport.impl.RequestRepository.addResponse(RequestRepository.java:51)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processResponse(JGroupsTransport.java:1570)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1471)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.up(JGroupsTransport.java:1663)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.JChannel.up(JChannel.java:734)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:936)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.protocols.relay.RELAY2.deliver(RELAY2.java:460)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.protocols.relay.RELAY2.route(RELAY2.java:352)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.protocols.relay.RELAY2.handleMessage(RELAY2.java:332)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.protocols.relay.RELAY2.up(RELAY2.java:160)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.protocols.FRAG2.up(FRAG2.java:139)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.protocols.FlowControl.up(FlowControl.java:245)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.protocols.FlowControl.up(FlowControl.java:245)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.protocols.pbcast.GMS.up(GMS.java:846)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:226)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1086)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.protocols.UNICAST3.addMessage(UNICAST3.java:825)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:807)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:456)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:680)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.protocols.VERIFY_SUSPECT2.up(VERIFY_SUSPECT2.java:105)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.protocols.FailureDetection.up(FailureDetection.java:180)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.protocols.FD_SOCK2.up(FD_SOCK2.java:188)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.protocols.MERGE3.up(MERGE3.java:274)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.protocols.Discovery.up(Discovery.java:294)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.stack.Protocol.up(Protocol.java:340)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.protocols.TP.passMessageUp(TP.java:1184)
March 6th 2024, 15:04:57.460 infinispan-0 at org.jgroups.util.SubmitToThreadPool$SingleMessageHandler.run(SubmitToThreadPool.java:107)
March 6th 2024, 15:04:57.460 infinispan-0 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
March 6th 2024, 15:04:57.460 infinispan-0 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
March 6th 2024, 15:04:57.460 infinispan-0 at java.base/java.lang.Thread.run(Thread.java:840)
March 6th 2024, 15:04:57.460 infinispan-0 Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000299: Unable to acquire lock after 4 seconds for key WrappedByteArray[03\>\$\2\5\2\c\4\7\c\9\-\b\3\7\5\-\4\f\3\5\-\9\a\d\3\-\e\1\3\0\7\4\a\6\3\e\3\4 (39 bytes)] and requestor CommandInvocation:infinispan-1-5398:9129. Lock is held by CommandInvocation:infinispan-1-5398:9127
March 6th 2024, 15:04:57.460 infinispan-0 at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
March 6th 2024, 15:04:57.460 infinispan-0 at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
March 6th 2024, 15:04:57.460 infinispan-0 at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
March 6th 2024, 15:04:57.460 infinispan-0 at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
March 6th 2024, 15:04:57.460 infinispan-0 at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.marshall.exts.ThrowableExternalizer.newThrowableInstance(ThrowableExternalizer.java:287)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.marshall.exts.ThrowableExternalizer.readGenericThrowable(ThrowableExternalizer.java:265)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.marshall.exts.ThrowableExternalizer.readObject(ThrowableExternalizer.java:240)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.marshall.exts.ThrowableExternalizer.readObject(ThrowableExternalizer.java:44)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.marshall.core.GlobalMarshaller.readWithExternalizer(GlobalMarshaller.java:727)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:708)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:357)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.marshall.core.BytesObjectInput.readObject(BytesObjectInput.java:32)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.remoting.responses.ExceptionResponse$Externalizer.readObject(ExceptionResponse.java:49)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.remoting.responses.ExceptionResponse$Externalizer.readObject(ExceptionResponse.java:41)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.marshall.core.GlobalMarshaller.readWithExternalizer(GlobalMarshaller.java:727)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:708)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:357)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.marshall.core.GlobalMarshaller.objectFromObjectInput(GlobalMarshaller.java:191)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.marshall.core.GlobalMarshaller.objectFromByteBuffer(GlobalMarshaller.java:220)
March 6th 2024, 15:04:57.460 infinispan-0 at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processResponse(JGroupsTransport.java:1562)
March 6th 2024, 15:04:57.460 infinispan-0 ... 29 more
March 6th 2024, 15:04:57.460 infinispan-0
Version
main
Expected behavior
(todo)
Actual behavior
(todo)
How to Reproduce?
Deploy rosa-cross-dc setup and run: https://github.com/mhajas/keycloak-benchmark/commit/19ac181408532cb453cd8a612db18a6dfe911b7c#diff-f2bc80ecb43ebc60741ae07893957eadf5ee1a552949320537fd8cb674f8efb7R98-R174 against it. I am using Intellij feature until failure because it does not fail all the time.
The log in case of failure look like this:
Starting with the concurrent requests
client-0 - 42d9cce8-4538-4b8f-a289-81945cd20118
client-1 - 40320f36-506e-4ef4-8db3-018cba236a1f
client-2 - 07a5e78d-0891-401b-8b84-c73299c78811
dc1 - client-1 - eyJhbGciOiJSUzI1.......
---------------------------------------------------------------
org.opentest4j.AssertionFailedError: Expected return code was 200 but was 400 with response body {"error":"invalid_grant","error_description":"User session not found"} ==> expected: <200> but was: <400>
at org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
at org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
at org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
.....
dc2 - client-0 -
---------------------------------------------------------------
dc1 - client-2 - eyJhbGciOiJSUz.....
870
5080
5142
Failure count: 1
Copy&paste any of the jwt tokens to jwt.io and the id of user session is present in session_state json field.
Now search Kibana in both DCs for:
kubernetes.pod_name:"keycloak-0" AND kubernetes.namespace_name:"mhajas-keycloak" AND kubernetes.container_name.raw:"keycloak" AND "252c47c9-b375-4f35-9ad3-e13074a63e34"
Note the PR contains some additional logging in the dataset provider which is not necessary but I added it to print some additional info about cache content in embedded and remote Infinispan. It is not enabled by default you can trigger it by calling DC1/2.kc().printSessions() in any moment in the test. The output is then visible in Keycloak logs.
The numbers at the end of log denotes the time needed to finish requests. You can see in case of repeated transactions the login takes more time.
Anything else?
No response
This can be relaxed/made much less probable if we execute the updates with backoff on the Keycloak side.
But we should probably first investigate how both sites may succeed with the replaceWithVersion operation.
@mhajas - thank you for the great write-up above!
@pruivo: AFAIK we are not using explicit transactions when calling the replace. Thinking out loud: How does Infinispan handle this with regards to transactions when it updates both the current site and the other site in xsite mode? Where does Infinispan check that the version passed in as a parameter matches the state of the cache - only on the local site, but not on the remote site?
I've started reading (again) https://infinispan.org/docs/stable/titles/xsite/xsite.html#cross-site-backup-strategies_cross-site-replication and I see that the default is "WARN" if "a backup operation fails". I wonder if the log entries Michal captured above are those warnings, and if we should set them to "FAIL" to get the behavior we want to see: "Throw exceptions when backup operations fail and attempt to stop writes to the local cache."
WDYT?
All caches are not transactional. The primary owner makes all decisions and, after updating the backup owner(s), it replicates to the other site. Changing to FAIL will notify Keycloak about the error but will not revert the local cache value, which is the up-to-date value.
The only way cross-site can rollback value is when it is configured to be part of the distributed transaction. Performance and response time can be worse by a lot.
Summary of discussion with @pruivo:
- in SYNC mode, we can have cross-site two-phase transaction. This can be configured in the external Infinispan. It is probably slow. Deadlocks can happen when acquiring the locks.
- in ASYNC mode, there are no transactions. Still, we could have a conflict, and the the conflict resolution could either show a warning, merge, or drop the entry. Spoiler: If we would have the sessions in the database, conflict resolution would be dropping it from the in-memory cache and reload it from the database next it is needed.
TODOs:
-
Change Synchronous strategy failure policy from
WARNtoFAILso that Keycloak will retry (this change is the prerequisite, we will still need to test if Keycloak would retry writing to the local-external Infinispan and then finally updates the remote-external Infinispan. When doing this, Keycloak will then always throw an error if writing the remote site fails, even if it is permanently unavailable. The remedy to to this scenario is to have a policy that marks the remote site as offline when too many error happen. From that moment on, the other site is then offline and needs to be set online manually. https://infinispan.org/docs/stable/titles/xsite/xsite.html#cross-site-automatic-offline_cross-site-replication. Let's start withmin-waitof 30 seconds. @ahus1 would like to see that for A/P in KC24 as we are currently never marking a remote site offline automatically no matter how many writes fail. Current behavior: Fail fast once the cross-site view has changed, and continue to log warnings. @pruivo - please prepare a PR for this that we would backport to KC24, and run a test with the remote site failing and the gossip router failing. -
run A/P with transactions and SYNC mode to see how much it affects the performance in a Multi-AZ single region setup (with transactions enabled in all caches). @pruivo - please prepare a second PR for this and compare the numbers
-
To check: Backoff for retried with a random component in the waiting time. This would decrease deadlock probability on retries.
Did a quick test transactional vs non-transactional with A/P mode. The results are pretty much the same with a minor slowdown.
Non Transactional:
Transactional:
This needs to be revisited for A/A mode with persistent sessions. A test case should be added to the functional testsuite.