microsoft-authentication-library-for-dotnet icon indicating copy to clipboard operation
microsoft-authentication-library-for-dotnet copied to clipboard

[Bug] Frequent MsalUiRequiredException from refresh tokens not found in cache, tokens cached under incorrect account ID

Open spongessuck opened this issue 2 years ago • 25 comments

Which version of MSAL.NET are you using? Microsoft.Identity.Client 4.42.0 Microsoft.Identity.Client.Extensions.Msal 2.16.5 Microsoft.Identity.Web.TokenCache 1.23.1

Overview

  1. Use a confidential client app in a web site scenario (S2S should repro this too). The app must be a singleton
  2. Configure some distributed cache (can be MemoryDistributedCache, but I recommend a file based DistributedCache)
  3. Make many requests in parallel

Actual: race conditions occur which make 2 accounts land in a single cache node. This causes cache misses later.

Possible cause: cache accessor objects are set per application, and not per request. Multiple requests share the same accessor.

Workaround: create the application object for each request.

Repro We have a .NET Framework MVC app using hybrid flow with OWIN to authenticate users and then retrieve access tokens via authorization code/refresh token. The tokens are fetched for each of 3 downstream APIs with separate app registrations in Azure. The app used to use ADAL, and at that time it had a naive token cache that was serialized and written to the auth cookie, but after adding a third API endpoint the cookie would end up too large and cause a 400 error, so we moved to MSAL and a distributed token cache using Azure Cache for Redis with the default L1/L2 setup. The tokens are fetched with a ConfidentialClientApplication that is singleton-scoped. We have no session state provider; "state" is light, and values are stored as claims in the identity and persist in the auth cookie.

We've had persistent issues with exceptions related to refresh tokens not being found in the token cache. Because of this, the app is set up to ensure a token for each of these APIs is able to be acquired silently at the beginning of each (MVC) request, and if it is not (via caught MsalUiRequiredExceptions), the user is redirected through the MS auth endpoint to get a new auth code and attempt to fetch and re-cache new tokens that can then be acquired silently. We've also implemented an eager refresh to check the expiration of the token and force refresh to get a new one if it will expire in less than 30 minutes. Even with these measures in place, frequent MsalUiRequiredExceptions persist and, rarely, more pernicious issues, like auth loops and "cannot log you in" pages if the tokens are repeatedly unable to be acquired silently.

One trend that we've seen is that the issues seem to be exacerbated by successive cache hits in a short duration. A theory I've had is that somehow there's contention while writing to the cache if, say, 2 requests happen at the same time for the same user, the token cannot be acquired silently, there are 2 calls for a new auth code, and a new refresh token is acquired for both, but maybe the first is invalidated because the second was provided and the first ends up in the cache. I'm grasping at straws, as I'm not sure how the underlying cache sequence works, if after the token fetched in L1 is tried and fails if it looks in L2 or goes straight to throwing an MsalUiRequiredException, if when a new token is cached if it's written to L1 first and then pushed to Redis, or what.

Another thing we've discovered is that we're seeing users' tokens cached under a key that doesn't match the user's account ID. This again leads me to think that there's some kind of thread-unsafe behavior with the cache, rescoping the CCA in DI to per-request and transient did not help prevent this from happening.

Expected behavior A valid refresh token is always available in the token cache. An MsalUiRequiredException is almost never thrown after a user has signed into the app.

Actual behavior Frequent MsalUiRequiredExceptions. Rare authentication loops and "Could not sign you in" pages after tokens for downstream APIs are repeatedly unable to be acquired silently even after auth code redemption.

spongessuck avatar Apr 15 '22 17:04 spongessuck

Here's a snippet from the OWIN middleware that ensures tokens for each downstream API can be fetched silently:

        public override async Task Invoke(IOwinContext context)
        {
            var request = new HttpRequestWrapper(HttpContext.Current.Request);
            var claimsPrincipal = context.Authentication.User;
            
            if (request.HttpMethod == "GET"
                && !request.IsAjaxRequest()
                && HttpContext.Current.CurrentHandler is MvcHandler handler
                && !controllersToAvoidChecking.Contains(handler.RequestContext.RouteData.Values["controller"].ToString())
                // Object ID claim will be null until auth'd
                && claimsPrincipal?.GetObjectId() != null
                && !await this.CanAcquireTokenSilent(claimsPrincipal.GetMsalAccountId()))
            {
                context.Authentication.Challenge(OpenIdConnectAuthenticationDefaults.AuthenticationType);
                return;
            }

            await Next.Invoke(context);
        }

        public virtual async Task<bool> CanAcquireTokenSilent(string msalAccountId)
        {
            var account = await clientApp.GetAccountAsync(msalAccountId);
            if (account == null)
                return false;

            try
            {
                foreach (var resource in this.config.Resources)
                {
                    await clientApp.AcquireTokenSilent(new[] { $"{resource.ClientId}/.default" }, account).ExecuteAsync();
                }
            }
            catch (MsalException ex)
            {
                // Ensure auth challenge
                return false;
            }

            return true;
        }

spongessuck avatar Apr 15 '22 19:04 spongessuck

Is it possible to enabled verbose logging and capture a trace for when MsalUiRequiredException happens?

Tokens are cached with key home_tenant_id.home_obj_id. For normal users (i.e, not guest users), claimsPrincipal.GetMsalAccountId(), which is really tid.oid matches. But for guest users, this no longer holds and results in cache miss. Our ASP.NET classic sample does not handle guest users.

I believe @jmprieur is looking at adding OWIN support to Microsoft.Identity.Web, might be worth following that.

bgavrilMS avatar Apr 18 '22 10:04 bgavrilMS

Also, you should not need Microsoft.Identity.Client.Extensions.Msal 2.16.5, this is for desktop apps.

bgavrilMS avatar Apr 18 '22 10:04 bgavrilMS

@bgavrilMS I've attached verbose logs.

I've also attached an example token cache value with 2 users' token info under a single user's key. This was while pointing IDistributedCache to a local redis instance.

All of our users are in our home tenant so there shouldn't be any guest users in our application, so I don't think that's the issue.

Thanks for the info about Microsoft.Identity.Client.Extensions.Msal. It's actually a dependency of another package, so I don't think it's an active part of our auth pipeline.

MSALLogs - redacted.txt tokenCacheEntry.txt

spongessuck avatar Apr 18 '22 14:04 spongessuck

How did you setup your cache? Are you using Microsoft.Identity.Web.TokenCache ? Or did you write your own - in which case can we see the code? Are you ever clearing the cache?

Are there any eviction policies on L1 / L2 ?

The cache access is not syncronized, i.e. if 2 requests for the same user happen at the same time, there's no guarantee for the outcome. Generally this is not a problem.

bgavrilMS avatar Apr 20 '22 15:04 bgavrilMS

I am using Microsoft.Identity.Web.TokenCache, I left the defaults in terms of settings. Here is how it's set up:

        instance.RegisterSelf(() =>
        {
            var openIdConnectAuthConfig = oidcConfigGetter();

            var clientApp = ConfidentialClientApplicationBuilder
                .Create(openIdConnectAuthConfig.ClientId)
                .WithClientSecret(openIdConnectAuthConfig.ClientSecret)
                .WithRedirectUri(openIdConnectAuthConfig.RedirectUri)
                .WithAuthority(new Uri(openIdConnectAuthConfig.Authority))
                .WithLegacyCacheCompatibility(false)
                .Build();

#if !NETCOREAPP2_1_OR_GREATER
            clientApp.AddDistributedTokenCache(services =>
            {
                services.AddStackExchangeRedisCache(options =>
                {
                    options.Configuration = distributedCacheConfigGetter().ConnectionString;
                });
            });
#endif

            return clientApp;
        });

spongessuck avatar Apr 20 '22 15:04 spongessuck

Is there a better way for me to configure the cache?

Would this issue make more sense in the microsoft-identity-web repo?

spongessuck avatar Apr 25 '22 17:04 spongessuck

which flow are you using, @spongessuck ? would it be AcquireTokenOnBehalfOf?

jmprieur avatar Apr 25 '22 18:04 jmprieur

Authorization code flow. It's a web app that requests tokens for downstream APIs.

spongessuck avatar Apr 25 '22 18:04 spongessuck

Are you using Microsoft.Identity.Web? Is the cache missed only with guest accounts? @spongessuck?

jmprieur avatar Apr 25 '22 18:04 jmprieur

I'm using Microsoft.Identity.Web.TokenCache but not the main package, as it's a .NET Framework app. My auth pipeline is done via OWIN.

On Mon, Apr 25, 2022, 14:08 Jean-Marc Prieur @.***> wrote:

Are you using Microsoft.Identity.Web?

— Reply to this email directly, view it on GitHub https://github.com/AzureAD/microsoft-authentication-library-for-dotnet/issues/3285#issuecomment-1108882033, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQTITWDN6DHJX3UUYJW7HLVG3NRNANCNFSM5TQ5YAYQ . You are receiving this because you were mentioned.Message ID: <AzureAD/microsoft-authentication-library-for-dotnet/issues/3285/1108882033 @github.com>

spongessuck avatar Apr 25 '22 18:04 spongessuck

@jmprieur there are no guest accounts, this is a single-tenant app and it's all internal users.

spongessuck avatar May 02 '22 20:05 spongessuck

Hi, we have the same problem - we use Redis cache to cache tokens, and our Redis records have different keys but the value is the same - there are tokens of all users of all AD organizations that ever logged in to our web application. We use the authorization code flow and Microsoft.Identity.Client library (so, not the Microsoft.Identity.Web.TokenCache) and this is our code:

            var app = ConfidentialClientApplicationBuilder
                .Create(ConfigurationSettings.ActiveDirectoryClientId)
                .WithClientSecret(ConfigurationSettings.ActiveDirectoryClientSecret)
                .WithTenantId(ConfigurationSettings.AzureSubscriptionId)
                .WithRedirectUri(redirectUri())
                .Build();

            // use Redis cache for token storage, this is needed to support instance scaling in Azure (by default tokens are stored only in memory)
            app.UserTokenCache.SetBeforeAccess(args =>
            {
                // read data from cache
                var data = CacheFactory.CreateCacheProvider().GetObject<byte[]>(args.SuggestedCacheKey);
                args.TokenCache.DeserializeMsalV3(data);
            });

            app.UserTokenCache.SetAfterAccess(args =>
            {
                // write data to cache with 3 months expiration
                if (args.HasStateChanged)
                {
                    var data = args.TokenCache.SerializeMsalV3();
                    if (data == null || !args.HasTokens)
                    {
                        CacheFactory.CreateCacheProvider().Invalidate(args.SuggestedCacheKey);
                    }
                    else
                    {
                        CacheFactory.CreateCacheProvider().AddOrUpdate(args.SuggestedCacheKey, data, a => data, expirationMode: ExpirationMode.Absolute, expireAfter: TimeSpan.FromDays(30 * 3));
                    }
                }
            });

            return app;

Currently, one record in Redis cache has around 5 MB, and this record is repeated under different cache keys (obj_id.tenant_id) for every user that logs in using AD. Our Redis cache quickly filled up and reading these records from Redis slows down our app. It seems that args.TokenCache always contains all records, not just the one of the user that is being logged in. But args.SuggestedCacheKey is two guids connected with dot (obj_id.tenant_id) What is interesting - our application is hosted in 3 cloud services (with extended support) on Azure, where 2 of the cloud services exhibit this problem, and 1 saves records into Redis correctly - one record per user, as we would expect. But the code is the same. We can't find any differences of AD app registrations in Azure configuration. App registrations are different for these 3 cloud services, but there is only one AD tenant (organization). Any ideas?

Mek7 avatar Sep 09 '22 09:09 Mek7

@spongessuck - for ASP.NET OWIN, I have a PR out in our sample that shows how to get the correct account id. See https://github.com/Azure-Samples/ms-identity-aspnet-webapp-openidconnect/pull/76 - this should fix the guest user scenarios. If you still have problems, can you try to reproduce the issue and get us some logs? AcquireTokenByAuthorizationCode should create an entry in Redis with the account_id key. And then GetAccount(account_id) should load that entry and return an IAccount object.

@Mek7 - you might have a bug in how you deserialize the cache I think. The low level APIs are quite tricky to get right. That's why it's better to use the higher-level API - https://docs.microsoft.com/en-us/azure/active-directory/develop/msal-net-token-cache-serialization?tabs=aspnet - here you just register Redis as IDistributedCache and it'll work. Just FYI, this is the adapter between low-level API and a Read / Write / Delete interface https://github.com/AzureAD/microsoft-identity-web/blob/master/src/Microsoft.Identity.Web.TokenCache/MsalAbstractTokenCacheProvider.cs. Note that the data is in json format, you should be able to inspect it to see how many tokens / account metadata objects there are in there. There should be 1 account per node. I expect a node to be around 5-7 KB.

bgavrilMS avatar Sep 09 '22 10:09 bgavrilMS

For the third time, we do not have guest users. All of our users are in our tenant.

On Fri, Sep 9, 2022, 06:52 Bogdan Gavril @.***> wrote:

@spongessuck https://github.com/spongessuck - for ASP.NET OWIN, I have a PR out in our sample that shows how to get the correct account id. See Azure-Samples/ms-identity-aspnet-webapp-openidconnect#76 https://github.com/Azure-Samples/ms-identity-aspnet-webapp-openidconnect/pull/76

  • this should fix the guest user scenarios. If you still have problems, can you try to reproduce the issue and get us some logs? AcquireTokenByAuthorizationCode should create an entry in Redis with the account_id key. And then GetAccount(account_id) should load that entry and return an IAccount object.

@Mek7 https://github.com/Mek7 - you might have a bug in how you deserialize the cache I think. The low level APIs are quite tricky to get right. That's why it's better to use the higher-level API - https://docs.microsoft.com/en-us/azure/active-directory/develop/msal-net-token-cache-serialization?tabs=aspnet

  • here you just register Redis as IDistributedCache and it'll work. Just FYI, this is the adapter between low-level API and a Read / Write / Delete interface https://github.com/AzureAD/microsoft-identity-web/blob/master/src/Microsoft.Identity.Web.TokenCache/MsalAbstractTokenCacheProvider.cs. Note that the data is in json format, you should be able to inspect it to see how many tokens / account metadata objects there are in there. There should be 1 account per node. I expect a node to be around 5-7 KB.

— Reply to this email directly, view it on GitHub https://github.com/AzureAD/microsoft-authentication-library-for-dotnet/issues/3285#issuecomment-1241819984, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQTITXTLGXU3QF2FE2HL5LV5MJHDANCNFSM5TQ5YAYQ . You are receiving this because you were mentioned.Message ID: <AzureAD/microsoft-authentication-library-for-dotnet/issues/3285/1241819984 @github.com>

spongessuck avatar Sep 09 '22 12:09 spongessuck

@spongessuck - the cache entry is strange, because it has 2 users in it. I would expect it to have a single user. Need to dive deeper in this:

  1. Is the ConfidentialClientApplication object a singleton?
  2. Before you call AcquireTokenSilent, you need an IAccount object. How do you get / create this account? Do you create one on your own? Or do you use cca.GetAccount(accound_id) where the account_id is the oid + . + tid from the ClaimsPrincipal associated with the session?

bgavrilMS avatar Sep 09 '22 13:09 bgavrilMS

Also adding @pmaytak to help investigate.

bgavrilMS avatar Sep 09 '22 13:09 bgavrilMS

  1. Yes, the ConfidentialClientApplication is a singleton. We don't have session state in our app so we can't keep an instance per session.
  2. Yes, we get the account like so:
    var account = await clientApp.GetAccountAsync(claimsPrincipal.GetMsalAccountId());
    
    ...
    public static class ClaimsPrincipalExtensions
    {
    	public static string GetObjectId(this ClaimsPrincipal claimsPrincipal)
    		=> claimsPrincipal.FindFirst("http://schemas.microsoft.com/identity/claims/objectidentifier")?.Value;
    
    	public static string GetTenantId(this ClaimsPrincipal claimsPrincipal)
    		=> claimsPrincipal.FindFirst("http://schemas.microsoft.com/identity/claims/tenantid")?.Value;
    
    	//https://github.com/Azure-Samples/ms-identity-aspnet-webapp-openidconnect/blob/69f656e139fb033ca3a6d9815a888a15677f5c96/WebApp/Utils/MsalAppBuilder.cs#L37
    	public static string GetMsalAccountId(this ClaimsPrincipal claimsPrincipal)
    	{
    		string oid = claimsPrincipal.GetObjectId();
    		string tid = claimsPrincipal.GetTenantId();
    		return $"{oid}.{tid}";
    	}
    }
    
    

spongessuck avatar Sep 09 '22 18:09 spongessuck

@spongessuck - my hypothesis is that there could be a bug in MSAL where a race condition between several requests is messing up the cache.

While we investigate, can you please try to move away from the singleton - create a confidential client application object for each request instead? This is what Microsoft.Identity.Web is doing internally and it works.

bgavrilMS avatar Sep 09 '22 22:09 bgavrilMS

Yes I'll try changing it to one per request. I probably won't have news for a little while as my team is busy with other things at the moment.

On Fri, Sep 9, 2022, 18:44 Bogdan Gavril @.***> wrote:

@spongessuck https://github.com/spongessuck - my hypothesis is that there could be a bug in MSAL where a race condition between several requests is messing up the cache.

While we investigate, can you please try to move away from the singleton - create a confidential client application object for each request instead? This is what Microsoft.Identity.Web is doing internally and it works.

— Reply to this email directly, view it on GitHub https://github.com/AzureAD/microsoft-authentication-library-for-dotnet/issues/3285#issuecomment-1242536398, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQTITQKED5CDVNY66Y6QF3V5O4TNANCNFSM5TQ5YAYQ . You are receiving this because you were mentioned.Message ID: <AzureAD/microsoft-authentication-library-for-dotnet/issues/3285/1242536398 @github.com>

spongessuck avatar Sep 09 '22 23:09 spongessuck

@Mek7 In your code, you should call args.TokenCache.DeserializeMsalV3(data, shouldClearExistingCache: true); like here. Is your CCA instance reused or recreated for each request?

MSAL has an internal cache structure (basically a dictionary). When cache serialization is enabled, either via callbacks or by using Microsoft.Identity.Web.TokenCaches, when MSAL wants to do a cache read operation, it calls that read callback to get a cache entry from an external cache (ex. Redis) to populate the internal dictionary, through which it then searches for a valid token. When MSAL wants to do a write operation, it serializes the contents of the dictionary and passes it to the write callback.

The case where one cache entry can have multiple accounts in it can happen if CCA instance is singleton and shouldClearExistingCache flag is false - then the internal cache is never cleared, grows with accounts and full contents are serialized to each cache entry.

@spongessuck In your case, since you use Microsoft.Identity.Web.TokenCache, it does set the flag correctly to clear the cache. However, the issue could be since the CCA is singleton, then consequently the internal cache is reused between requests causing the cache state to be inconsistent. So recreating the CCA instance for each request may fix this.

pmaytak avatar Sep 09 '22 23:09 pmaytak

@pmaytak We will pass the flag shouldClearExistingCache: true to the deserialization method and see how it goes. Yes, our CCA is a singleton. It is created in a property getter that returns Lazy<IConfidentialClientApplication>. The CCA Builder isn't initialized with any request-specific properties, so we thought it should be created once and then reused. Is it correct for a web application? Or should the CCA be built on every access to the property or every http request? Now we will try the flag, will let you know here what is the result, thanks for the suggestion :)

Mek7 avatar Sep 12 '22 06:09 Mek7

Generally we suggest new CCA for every web request.

pmaytak avatar Sep 12 '22 19:09 pmaytak

Thanks. The flag shouldClearExistingCache: true didn't make a difference, so now we will try a new CCA for every web request. Will let you know the result.

Mek7 avatar Sep 13 '22 05:09 Mek7

So far it looks that it helped. We will continue monitoring Redis cache usage. Thanks again for the hints and explanations.

Mek7 avatar Sep 14 '22 06:09 Mek7

Potential workaround 2: use WithCacheSynchronization(true) when you construct the CCA

bgavrilMS avatar Nov 02 '22 13:11 bgavrilMS

Changing the binding to per-request instead of singleton has greatly reduced the issues our users are experiencing.

The documentation should be updated to make it clear that sharing the CCA between requests will cause issues; this would have prevented a lot of headaches for us.

spongessuck avatar Jan 23 '23 15:01 spongessuck

Bumping to P1 as it affects MISE scenarios.

@jennyf19 @jmprieur @GeoK @pmaytak - let's use this work item for tracking.

bgavrilMS avatar Jul 21 '23 13:07 bgavrilMS

Update: this issue seems to be affecting more configurations. The only workaround is to create a CCA object on each request. Setting CacheSyncronization flag does not fix it.

bgavrilMS avatar Jul 21 '23 15:07 bgavrilMS

@pmaytak - is not not fixed yet?

bgavrilMS avatar Aug 17 '23 14:08 bgavrilMS