azure-ad-plugin icon indicating copy to clipboard operation
azure-ad-plugin copied to clipboard

Intermittent group lookup failure will lock you out of Jenkins for 1 hour (Do not have sufficient privileges to fetch your belonging groups' authorities.)

Open glorang opened this issue 9 months ago • 2 comments

Jenkins and plugins versions report

Environment

Jenkins 2.479.3 Microsoft Entra ID (previously Azure AD) Plugin 531.v13107da_f2635

What Operating System are you using (both controller, and any agents involved in the problem)?

Debian 11

Reproduction steps

N/A

Expected Results

See below

Actual Results

See below

Anything else?

The sysadmin speaking here, not the Java dev so bear with me :-)

In the past two days we're getting intermittent failures during the SAML authentication when authenticating users. E.g. when you go from Jenkins -> Microsoft -> Jenkins it "hangs" for 10-20 seconds on the Microsoft part/URL.

After those 10-20 seconds you are authenticated (your name is written top-right) and redirected to Jenkins again, but you'll get a "Access Denied - <UUID> is missing the Overall/Read permission" and you cannot do anything.

At the same time following stack trace is thrown in Jenkins log:

Jan 30 11:12:05 jenkins-server jenkins[1337699]: 2025-01-30 10:12:05.638+0000 [id=180804]        WARNING        c.m.j.azuread.AzureCachePool#lambda$getBelongingGroupsByOid$1: Do not have sufficient privileges to fetch your belonging groups' authorities.
Jan 30 11:12:05 jenkins-server jenkins[1337699]: java.io.IOException: Canceled
Jan 30 11:12:05 jenkins-server jenkins[1337699]:         at PluginClassLoader for okhttp-api//okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:72)
Jan 30 11:12:05 jenkins-server jenkins[1337699]:         at PluginClassLoader for okhttp-api//okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
Jan 30 11:12:05 jenkins-server jenkins[1337699]:         at PluginClassLoader for azure-ad//com.microsoft.graph.httpcore.RedirectHandler.intercept(RedirectHandler.java:137)
Jan 30 11:12:05 jenkins-server jenkins[1337699]:         at PluginClassLoader for okhttp-api//okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
Jan 30 11:12:05 jenkins-server jenkins[1337699]:         at PluginClassLoader for azure-ad//com.microsoft.graph.httpcore.RetryHandler.intercept(RetryHandler.java:177)
Jan 30 11:12:05 jenkins-server jenkins[1337699]:         at PluginClassLoader for okhttp-api//okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
Jan 30 11:12:05 jenkins-server jenkins[1337699]:         at PluginClassLoader for azure-ad//com.microsoft.graph.httpcore.AuthenticationHandler.intercept(AuthenticationHandler.java:59)
Jan 30 11:12:05 jenkins-server jenkins[1337699]:         at PluginClassLoader for okhttp-api//okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
Jan 30 11:12:05 jenkins-server jenkins[1337699]:         at PluginClassLoader for azure-ad//com.microsoft.graph.httpcore.TelemetryHandler.intercept(TelemetryHandler.java:68)
Jan 30 11:12:05 jenkins-server jenkins[1337699]:         at PluginClassLoader for okhttp-api//okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
Jan 30 11:12:05 jenkins-server jenkins[1337699]:         at PluginClassLoader for okhttp-api//okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)
Jan 30 11:12:05 jenkins-server jenkins[1337699]: Caused: java.io.InterruptedIOException: timeout
Jan 30 11:12:05 jenkins-server jenkins[1337699]:         at PluginClassLoader for okhttp-api//okhttp3.internal.connection.RealCall.timeoutExit(RealCall.kt:398)
Jan 30 11:12:05 jenkins-server jenkins[1337699]:         at PluginClassLoader for okhttp-api//okhttp3.internal.connection.RealCall.callDone(RealCall.kt:360)
Jan 30 11:12:05 jenkins-server jenkins[1337699]:         at PluginClassLoader for okhttp-api//okhttp3.internal.connection.RealCall.noMoreExchanges$okhttp(RealCall.kt:325)
Jan 30 11:12:05 jenkins-server jenkins[1337699]:         at PluginClassLoader for okhttp-api//okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:209)
Jan 30 11:12:05 jenkins-server jenkins[1337699]:         at PluginClassLoader for okhttp-api//okhttp3.internal.connection.RealCall.execute(RealCall.kt:154)
Jan 30 11:12:05 jenkins-server jenkins[1337699]:         at PluginClassLoader for azure-ad//com.microsoft.graph.http.CoreHttpProvider.sendRequestInternal(CoreHttpProvider.java:407)
Jan 30 11:12:05 jenkins-server jenkins[1337699]: Caused: com.microsoft.graph.core.ClientException: Error executing the request

So it looks like that for whatever reason (our network, something in Azure, somthingin Microsoft's network, ....) there is some hickup in looking up the group membership - note that the graph API permissions are fine, in general everything is working very well for us - you'll be locked out from Jenkins for 1 hour.

Looking at the code here : https://github.com/jenkinsci/azure-ad-plugin/blob/1f7a456edadba6178cedb277317d8cebb0ff21cc/src/main/java/com/microsoft/jenkins/azuread/AzureCachePool.java#L24

It looks like the result will be cached for 1 hour, regardless if it was successful or not

And because of the https://github.com/jenkinsci/azure-ad-plugin/blob/1f7a456edadba6178cedb277317d8cebb0ff21cc/src/main/java/com/microsoft/jenkins/azuread/AzureCachePool.java#L76 we'll be locked out of our Jenkins for 1 hour.

After 1 hour access is OK again (sometimes we need to logout / login), it also looks like multiple users are affected at the same time, e.g. if 1 user of a group gets this timeout all users of that group will be locked out - or vice/versa: if it still works for 1 user of e.g. our admin group it will still work for all users in the admin group.

To give you an idea of the occurrences:

# zgrep "Do not have sufficient privileges to fetch your belonging groups" /var/log/syslog*
/var/log/syslog:Jan 28 14:56:16 jenkins-server jenkins[1337699]: 2025-01-28 13:56:16.568+0000 [id=75285]#011WARNING#011c.m.j.azuread.AzureCachePool#lambda$getBelongingGroupsByOid$1: Do not have sufficient privileges to fetch your belonging groups' authorities.
/var/log/syslog:Jan 28 14:57:13 jenkins-server jenkins[1337699]: 2025-01-28 13:57:13.064+0000 [id=76013]#011WARNING#011c.m.j.azuread.AzureCachePool#lambda$getBelongingGroupsByOid$1: Do not have sufficient privileges to fetch your belonging groups' authorities.
/var/log/syslog:Jan 28 14:59:07 jenkins-server jenkins[1337699]: 2025-01-28 13:59:07.089+0000 [id=74501]#011WARNING#011c.m.j.azuread.AzureCachePool#lambda$getBelongingGroupsByOid$1: Do not have sufficient privileges to fetch your belonging groups' authorities.
/var/log/syslog:Jan 28 15:02:04 jenkins-server jenkins[1337699]: 2025-01-28 14:02:04.745+0000 [id=76833]#011WARNING#011c.m.j.azuread.AzureCachePool#lambda$getBelongingGroupsByOid$1: Do not have sufficient privileges to fetch your belonging groups' authorities.
/var/log/syslog:Jan 28 15:04:45 jenkins-server jenkins[1337699]: 2025-01-28 14:04:45.979+0000 [id=77112]#011WARNING#011c.m.j.azuread.AzureCachePool#lambda$getBelongingGroupsByOid$1: Do not have sufficient privileges to fetch your belonging groups' authorities.
/var/log/syslog:Jan 28 15:06:04 jenkins-server jenkins[1337699]: 2025-01-28 14:06:04.633+0000 [id=76833]#011WARNING#011c.m.j.azuread.AzureCachePool#lambda$getBelongingGroupsByOid$1: Do not have sufficient privileges to fetch your belonging groups' authorities.
/var/log/syslog:Jan 28 15:08:07 jenkins-server jenkins[1337699]: 2025-01-28 14:08:07.814+0000 [id=77106]#011WARNING#011c.m.j.azuread.AzureCachePool#lambda$getBelongingGroupsByOid$1: Do not have sufficient privileges to fetch your belonging groups' authorities.
/var/log/syslog:Jan 30 11:05:59 jenkins-server jenkins[1337699]: 2025-01-30 10:05:59.163+0000 [id=180349]#011WARNING#011c.m.j.azuread.AzureCachePool#lambda$getBelongingGroupsByOid$1: Do not have sufficient privileges to fetch your belonging groups' authorities.
/var/log/syslog:Jan 30 11:12:05 jenkins-server jenkins[1337699]: 2025-01-30 10:12:05.638+0000 [id=180804]#011WARNING#011c.m.j.azuread.AzureCachePool#lambda$getBelongingGroupsByOid$1: Do not have sufficient privileges to fetch your belonging groups' authorities.
/var/log/syslog:Jan 30 11:14:11 jenkins-server jenkins[1337699]: 2025-01-30 10:14:11.388+0000 [id=180803]#011WARNING#011c.m.j.azuread.AzureCachePool#lambda$getBelongingGroupsByOid$1: Do not have sufficient privileges to fetch your belonging groups' authorities.

I'm not a real coder anymore, but I guess the more preferred behavior would be to invalidate belongingGroupsByOid immediately when you reach that Exception block.

full-stack-trace.txt

Are you interested in contributing a fix?

No response

glorang avatar Jan 30 '25 11:01 glorang

Thanks for the report and the analysis. I'd say yes if group integration is enabled and it fails a bad result shouldn't be cached. 👍

timja avatar Jan 30 '25 12:01 timja

Environment Versions Jenkins core: 2.462.1 ExtendedEmailPublisher: 2.97 EntraID: 385.v5d9f88612dd2 Caffeine API: 3.1.8-133.v17b_1ff2e0599 Thanks for opening the bug and investigating. We are experiencing this issue with EntraID but also with ExtendedEmailPublisher plugins maybe related to how Caffeine is used with group resolution?

EntraID Issue All our authenticated users get read access so our users aren't getting locked out of Jenkins but they don't have the permissions to trigger jobs since the memberships aren't resolved.

ExtendedEmailPublisher Issue When send mail being called, it fails with timeout to connect with microsoft graph for group resolution and fails the job. Attached is the stack trace.

I saw that there is a new version of Caffeine API 3.2.0 but couldn't tell if potential fix is included

ext-mail-sendmail-failure.txt

ibidani avatar Apr 06 '25 16:04 ibidani