Offline User Session dead lock caused by PersisterLastSessionRefreshStore
Describe the bug
There is a deadlock under some circumstances with latest Keycloak.
In case there are 2 cluster nodes and both updating DB at the same time from persister:
Node1 calling something like (real Ids replaced with placeholders):
update OFFLINE_USER_SESSION set LAST_SESSION_REFRESH=1656332405 where REALM_ID='X' and
OFFLINE_FLAG='1' and (USER_SESSION_ID in ('a' , 'b' , 'c' , 'd'))
Node2 calling something like (real Ids replaced with placeholders):
update OFFLINE_USER_SESSION set LAST_SESSION_REFRESH=1656332405 where REALM_ID='Y' and
OFFLINE_FLAG='1' and (USER_SESSION_ID in ('aa' , 'bb' , 'cc' , 'dd'))
From the lock monitor, both are named JPQL "updateUserSessionLastSessionRefresh" called from different RH-SSO node.
JPQL "updateUserSessionLastSessionRefresh" definition:
update PersistentUserSessionEntity sess set lastSessionRefresh = :lastSessionRefresh where sess.realmId = :realmId AND
sess.offline = :offline AND sess.userSessionId IN (:userSessionIds)
https://github.com/keycloak/keycloak/blob/19.0.1/model/jpa/src/main/java/org/keycloak/models/jpa/session/PersistentUserSessionEntity.java#L38-L39
When token refresh or token introspection is requested to an offline session, RH-SSO need to update LAST_SESSION_REFRESH culumn of the OFFLINE_USER_SESSION table record, but instead of issuing the update query immediately, RH-SSO enqueue the update request in AbstractLastSessionRefreshStore, and update them periodically. JPQL "updateUserSessionLastSessionRefresh" is the query for this bulk update.
Since token refresh or token introspection requests from clients are distributed to RH-SSO nodes, it may happen that multiple LAST_SESSION_REFRESH update queries are issued from different RH-SSO nodes at the same time. If it happens to multiple sessions, it ends up in the dead lock because RH-SSO node wraps the multiple update request into one transaction.
Dead lock flow example
If clients requests offline sessions X and Y to different RH-SSO nodes(A and B), RH-SSO Node A and B issues update queries in one transaction. This will cause dead lock.
RH-SSO Node A ----> Begin Tx ----> Update Realm001 session X ---------------> Update Realm002 session Y -----------> Dead
Lock!(Waiting for Realm002 session Y)
RH-SSO Node B --------> Begin Tx ------------> Update Realm002 session Y -------------> Update Realm001 session X -> Dead
Lock!(Waiting for Realm001 session X)
Solution proposal
Sorting the session update order will resolve the dead lock.
I think sorting the items by key here will work. https://github.com/keycloak/keycloak/blob/19.0.1/model/infinispan/src/main/java/org/keycloak/models/sessions/infinispan/changes/sessions/PersisterLastSessionRefreshStore.java#L64 https://github.com/keycloak/keycloak/blob/19.0.1/model/infinispan/src/main/java/org/keycloak/models/sessions/infinispan/changes/sessions/PersisterLastSessionRefreshStore.java#L72
RH-SSO Node A ----> Begin Tx ----> Update Realm001 session X ---------------> Update Realm002 session Y ---> Commit --->
Finish!
RH-SSO Node B --------> Begin Tx ------------> Update Realm001 session X ------> (Wait for Realm001 session X lock) ----->
Update Realm002 session Y --> Commit --> Finish!
Could you check my analysis? and if it valid, could you provide a fix and include it to the next RH-SSO 7.6 patch update?
Version
19.0.1
Expected behavior
No response
Actual behavior
No response
How to Reproduce?
No response
Anything else?
This is related to ticket 293.
Originally reported by @ynojima