Windows-Containers Using gMSA on multiple containers simultaneously causes Domain Trust Relationship to fail

Describe the bug When running multiple containers simultaneously using the same gMSA on either the same host or different hosts, it causes one or multiple containers to lose their domain trust relationship leading to various issues including LsaLookUp and negotiate auth failures. This especially happens when the count of containers is equal to or more than count of domain controllers in the environment. However, it is also possible to run into this issue when the count of containers is less than count of domain controllers in the environment, provided two or more containers attempt to talk to the same domain controller.

To Reproduce

Build an image from the following Dockerfile

FROM mcr.microsoft.com/dotnet/aspnet:6.0-windowsservercore-ltsc2019 AS base

USER ContainerAdministrator
RUN reg.exe add "HKLM\SYSTEM\CurrentControlSet\Control\Lsa" /v LsaLookupCacheMaxSize /t REG_DWORD /d 0 /f

USER ContainerUser
ENTRYPOINT ["powershell.exe", "1..500 | %{ [void][System.Security.Principal.NTAccount]::new('contoso\\someobj').Translate([System.Security.Principal.SecurityIdentifier]).Value; Start-Sleep -Milliseconds (Get-Random -Minimum 100 -Maximum 1000); }"]

Replace contoso\someobj above with sam name of an actual object.

Run the container image simultaneously on multiple hosts using the following command. To increase the chances of running into the issue, if there are N domain controllers in the environment, run the container image simultaneously on at least N+1 hosts

docker run --security-opt "credentialspec=file://gmsa-credspec.json" --hostname <gMSAName>  -it <image>

Replace <gMSAName> with actual gMSA and file://gmsa-credspec.json with actual gMSA Credential Spec file and <image> with the container image

Monitor the output of all the containers, eventually one or more containers will start throwing the following error message. This usually happens within first few seconds of the container starting, assuming the docker run ... in (2) above was run simultaneously on different hosts. If it does not happen, repeat (2) until it does.

Exception calling "Translate" with "1" argument(s): "The trust relationship between this workstation and the primary domain failed.

While a running container is throwing the above error message in its output, exec into it and try performing some domain operation - that will fail as well.

Expected behavior gMSAs on multiple Windows Containers is officially supported since at least Windows Server 2019. Running gMSA on multiple containers simultaneously should not result in trust relationship to fail.

Configuration:

Edition: Windows Server 2022
Base Image being used: Windows Server Core
Container engine: docker

Additional context

While the reproducer uses a PowerShell base image to demonstrate the bug, we had originally run into this issue in an ASP.NET Core web application while performing negotiate authentication.
The container image in the reproducer purposefully disables LSA LookUp Cache by setting LsaLookupCacheMaxSize to 0 to simplify the example.
If you were to observe traffic of a container that has run into this issue, the packet capture will indicate a lot of DSERPC/RPC_NETLOGON failure messages. You may also observe packets reporting nca_s_fault_sec_pkg_error
Sometimes the container may "autorecover". It is purely a chance event. Whenever this happens, you can see RPC_NETLOGON packets in the network capture. Typically this results in the container recovering its domain trust relationship only when the NETLOGON happens through a different domain controller than what container had earlier communicated to.
It is also possible to re-establish domain trust relationship of a failing container by running the following command in a failing container (the runtime user should be a ContainerAdministrator or should have administrators privileges)
```
nltest.exe /sc_reset:contoso.com
```
If the above command does not succeed, you may have to run it more than once. When the command succeeds, more often than not, all the affected containers and not just the current container "recover".
As mentioned in the bug description, it is very easy to run into this issue when the count of containers is more than the number of domain controllers in the environment but that is not the only scenario.
docker run ... is not the only way to run into this issue. It can be also be reproduced on an orchestration platform like Kubernetes, by setting replicas count of the Deployment to N+1; or by using scaling feature.

Aug 02 '23 18:08 avin3sh

Hi, thanks for bringing this issue to our attention. First, I've have to give credit where credit is due. This is so well written up! Thank you for providing a very clear description of the current and expected behavior.

Second, this is a quick question: Is there a reason why all the containers in this cluster all have the same gMSA?

Aug 04 '23 18:08 ntrappe-msft

Is there a reason why all the containers in this cluster all have the same gMSA?

We actually don't use the same gMSA for all the containers in the cluster. Different type of application containers run with different gMSAs.

The problem arises when there are multiple instances (replicas) of the same application, such as an application that requires to be highly available. During my testing I also found that it does not have to be replicas of same container image/deployment, different containers still running as the same gMSA will also run into this issue.

Multiple containers running as same gMSA can't be avoided for these purposes - without them we can't distribute our workload or promise high availability.

Aug 07 '23 11:08 avin3sh

@ntrappe-msft has there been an internal confirmation of this bug and any discussions on a fix ? This issue severely limits ability to scale Windows containers and use AD authentication because of direct relation between number of containers and domain controllers.

Sep 06 '23 12:09 avin3sh

Hi, thank you for your patience! We know this is blocking you right now and we're working hard to make sure it's resolved as soon as possible. We've reached out to the gMSA team to get more context on the problem and some troubleshooting suggestions.

Sep 06 '23 16:09 ntrappe-msft

The gMSA team is still doing their investigation but they can confirm that this is unexpected and unusual behavior. We may ask for some logs in the future if it would help them diagnose the root cause.

Sep 20 '23 20:09 ntrappe-msft

Hi, could you give us a few follow-up details?

Are you using process-isolated or hyper-v isolated containers?
Are you using the same container hostname and gMSA name?
What is the host OS version?

Nov 06 '23 18:11 ntrappe-msft

Hi Nicole @ntrappe-msft

Are you using process-isolated or hyper-v isolated containers?

Process Isolation

Are you using the same container hostname and gMSA name?

Correct

What is the host OS version?

Microsoft Windows Server 2022 Standard (Core), with October CU applied

Sharing some more data from our experiments, in case it help the team to troubleshoot the issue:

When all the containers of a gMSA are given a different, unique, value for the hostname, at least the Domain Trust Relationship error goes away - although that may have broken something else, we did not look in that direction. However;
If the value of hostname for each container is >15 characters in length, and the value is unique BUT first 15 characters are not-unique, we again start seeing the issue related to Domain Trust Relationship. This interestingly coincides with 15 character length limit for computer name / NETBIOS limitation.

This means if you have a very long value of hostname and first few characters are not unique, gMSA issues start occurring in multi-container scenario.

If you were to use some container orchestration solution, like Kubernetes, the value of pod name, which is what gets supplied as hostname value to the container runtime, is in all the realistic scenarios >15 characters and the first few characters are common for each pod (deployment name + replicaset ID) -- this would cause problem with gMSAs in that case as well
Just out of curiosity, instead of docker runtime, I directly used containerd and I could reproduce the problem there as well
Not specifying hostname when launching containers with same gMSA does not give this error, I believe the container runtime internally gives some random ID as the value for hostname in that case (scenario (1) above) -- that seem to imply the problem here is multiple container having same name ?

In context of containers with gMSA, having same name as gMSA name has been the norm for a while. Not specifying hostname isn't always possible, explicitly specifying hostname shouldn't break the status quo, and when using orchestration solutions, like the example I listed above, the user has no direct control on the value of hostname.

This issue has been severely restricting usage of Windows Containers at scale :(

Nov 07 '23 07:11 avin3sh

🔖 ADO 47828389

Nov 23 '23 01:11 ntrappe-msft

While we appreciate that the Containers team is still looking into this issue, I wanted to share some insights into just how seemingly difficult this problem is to work around.

In order to prevent requests landing on "bad" containers, I was trying to write custom aspnet core health check that could inquire the status of Trust Relationship of the container and mark the service as unhealthy when Domain Trust fails. What seemed to be a very straightforward tempory fix/compromise for our problems turned out to be a complex anomaly:

Firstly, netapi32 DLL is not available in nanoserver, and won't be until next major release of Windows Server - https://github.com/microsoft/Windows-Containers/issues/72#issuecomment-1569257600
If we have the Server Core image as the base image and have the DLL moved to the nanoserver container, we could work around this but only to run into more problems
Within the gMSA container - the Win32 call will not automatically pick the Netlogon Policy Server
And if you do hardcode a domain controller for this purpose, the netlogon query response would still indicate that the trust relationship exists (NERR_Success as opposed to something like RPC_S_SERVER_UNAVAILABLE) - and this is while the container is actively reporting trust errors while performing AD operations
And even if we had managed to get all of this to work, to "repair" the Secure Channel we would have to run our container as ContainerAdministrator which introduces bunch of other security concerns
PowerShell commands such as Test-ComputerSecureChannel simply fail, because the interpretation of "hostname" is different within a gMSA Container vs. outside of it - where the command is typically used
In essence, any of the means to [programmatically] catch gMSA and Domain Trust issues for Containers, like ones documented at https://kubernetes.io/docs/tasks/configure-pod-container/configure-gmsa/#troubleshooting, turned out to be unhelpful

My guesses for why the usual means to troubleshoot gMSA/Trust problems are not working for us is probably an attempted to fix a VERY SIMILAR problem for Containers in Server 2019:

We changed the behavior in Windows Server 2019 to separate the container identity from the machine name, allowing multiple containers to use the same gMSA simultaneously.

Since we do not understand how this was achieved, we have again reached a dead end and are desperately hoping the Containers team is able to solve our gMSA-Containers-At-Scale problem

Dec 27 '23 16:12 avin3sh

Thanks for the additional details. We've had a number of comments from internal and external teams struggling with the same issue. Our support team is still working to find a workaround that they can publish.

Jan 25 '24 21:01 ntrappe-msft

Support team is still working on this. We'll make sure we also update our "troubleshoot gMSAs" documentation when we can address the problem.

Feb 05 '24 21:02 ntrappe-msft

We're also running into this issue, we're using Windows Server 2019 container images, however there are no multiple container instances running with the same gMSA however we still get the same error about trust. Our case is that we try to login with an AD user it doesn't work, but the gMSA does work, should I raise a ticket with support for assistance.

Update:

All of our containers have the same host name even if they run using different gMSAs
Using a different name for the containers does not solve the issue

Feb 28 '24 21:02 israelvaldez

Hello @ntrappe-msft - is Containers team in touch with the gMSA/CCG group. Our support engineers informed us that we are the only ones who have reported this issue, but based on your confirmation in https://github.com/microsoft/Windows-Containers/issues/405#issuecomment-1911045014, and assuming from reactions on this issue, it is clear there are many users who have run into this exact problem.

Our case is that we try to login with an AD user it doesn't work, but the gMSA does work, should I raise a ticket with support for assistance.

@israelvaldez, see my above comment. I would think it is worth highlighting this problem to Microsoft Support from your end as well, so that that it is obvious, without any doubt, that multiple customers face this and it could be appropriately prioritized (if not already)

Mar 06 '24 12:03 avin3sh

Hi @ntrappe-msft we are also experiencing the same issue with our gMSA containers intermittently losing trusts with our domain and needs to be restarted. Wondering if Microsoft has any update on this issue.

We have multiple container instances running the same app and using the gMSA. Interestingly even though each of them have their own unique hostname defined, the log shows it's connecting to the DC using the gMSA name as MachineName. Host/domain/dc names replaced with **.

EventID : 5720 MachineName : gmsa_** Data : {138, 1, 0, 192} Index : 1309 Category : (0) CategoryNumber : 0 EntryType : Error Message : The session setup to the Windows Domain Controller \** for the domain ** failed because the computer gmsa_** does not have a local security database account. Source : NETLOGON ReplacementStrings : {\**, **, **} InstanceId : 5720 TimeGenerated : 13/03/2024 10:23:24 AM TimeWritten : 13/03/2024 10:23:24 AM UserName : Site : Container :

Mar 18 '24 01:03 WillsonAtJHG

@avin3sh you are definitely not the only one experiencing this Issue. There are a number of internal teams who would like to increase the severity of this Issue and attention towards it. I'm crossing my fingers that we'll have a positive update soon. But it does help us if more people comment on this threat highlighting that they too are encountering this problem.

Mar 20 '24 23:03 ntrappe-msft

This is a huge issue for us at Broadcom with multiple fortunate 100 customers wanting this feature in one of our products and thousands of workloads being blocked from being migrated off VMs to containers

Mar 27 '24 15:03 macsux

In my scenario I created a new gMSA othern than the one I was using (which was not being used in multiple pods) and I was able to workaround this problem. i.e. my pod had gmsa1, I created gmsa2 and suddenly the trust betweent he pod and the domain was fine.

Apr 03 '24 16:04 israelvaldez

The workaround is appreciated, but we would like to see Microsoft fix this issue directly so that customers do not need to significantly redesign their environments.

Apr 03 '24 18:04 julesroussel3

This issue has been fortunate enough to not get attention of auto-reminder bots so far, but I am afraid they will be here anytime soon. I see this has been finally assigned, does it mean a fix is in the works ?

May 03 '24 14:05 avin3sh

Please do not close this issue until the underlying technical problem has been resolved.On Jun 3, 2024, at 3:01 PM, microsoft-github-policy-service[bot] @.***> wrote: This issue has been open for 30 days with no updates. @riyapatel-ms, please provide an update or close this issue.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

Jun 04 '24 12:06 julesroussel3

We have started seeing a new issue with nanoserver images released April onwards (build 20348.2402+), the HTTP service running inside the container has started throwing 'System.Net.InternalException' was thrown. -1073741428, which, as per someone in the .NET platforms, translates to The trust relationship between the primary domain and the trusted domain failed. (see: https://github.com/dotnet/runtime/discussions/105567#discussioncomment-10161657)

As a result, all our new containers are failing to serve ANY incoming kerberized requests!! This is no longer intermittent. This is no longer about number of containers running simultaneously with a gMSA. This is straight up fatal error rendering the container pretty much unusable.

Now one would think "downgrading" to an older nanoserver image released prior to April would fix this ? Wrong. That would make the problem even more worse because of the another unresolved Windows-Containers issue - https://github.com/microsoft/Windows-Containers/issues/502 -- downgrading will potentially cause all the container infrastructure to BSOD!!!

To summarize,

the original issue remains unresolved!
April onwards, you can't use latest or even older nanoserver images
apps built off newer images are pretty much incapable of Kerberos
going back to images built off March or prior CUs has potential to cause your container host to go in BSOD

This issue desperately needs a fix. It's almost as if you can't use Windows Containers for any of your gMSA and Active Directory use cases anymore!

Jul 26 '24 16:07 avin3sh

We are facing also similar issue on the usage of gMSA within scaling windows containers. We also provide hostname into the container creation, but in fact due to gMSA containers are identify themself as the gMSA name. This leads to mismatches on our backend that tries to keep track on incoming traffic. It gets heavily confused while all request are coming from the same "machine". Of course, as long as I only have one container started making use of the one gMSA I am all good. the moment I scale it crashes. (fun fact: the product that gets confused is also from Microsoft :P)

So also curious what will happen to this :)

Ultimately, this is what kills me (from here )

Can't it put the container/hostname as suffix or so ? :D

Aug 09 '24 11:08 KristofKlein

We appear to maybe be facing a similair issue "The trust relationship between the primary domain and the trusted domain failed" on our AKS cluster. Is this being worked on?

Sep 02 '24 09:09 NickVanRaaijT

Quick question on the environment you folks have on which you are seeing this issue: Is NETBIOS enabled in your environment? NETBIOS uses port 137,138, and 139, with 139 being Netlogon. I have tested this with a customer (who was kind enough to validate their environment) on which a deployment with multiple pods worked normally. This customer has NETBIOS disabled and port 139 between pods/AKS cluster is blocked to the Domain Controllers.

I'm not saying this is a fix, but wanted to check if others see this error even with NETBIOS disabled or the port blocked.

Sep 10 '24 19:09 vrapolinario

From what I have found (I can do a more thorough test later), NETBIOS is disabled on the container host's primary interface and on the HNS management vNIC (we use Calico in VXLAN mode). However, the vNICs for individual pods show NETBIOS as enabled. We haven't done anything to block traffic on Port 139.

Do you suggest we perform a test after disabling NETBIOS on Pod vNICs as well; AND blocking Port 139 ? I am not sure how to configure this within CNI but perhaps I can write some script to disable netbios by making registry change after the container is network has come up, unless you have some script handy that you could share.

BTW just to reiterate the severity from my earlier comment https://github.com/microsoft/Windows-Containers/issues/405#issuecomment-2253067799 - nanoserver images after March 2024 have made this problem worse. Earlier the issue was intermittent and dependent on some environmental factors but March 2024+ nanoserver images are causing 100% failures.

Sep 11 '24 13:09 avin3sh

Thanks @avin3sh for the note. No need for a fancy script or worrying from the cluster/pod side - if you block port 139 at the network/NSG level, this should help validate. Again, I'm asking here as a validation, we haven't been able to narrow it down yet, but we have customers running multiple containers simultaneously with no errors and I noticed they have NETBIOS disabled AND port 139 blocked.

As for the Nano Server issue, can you please clarify: The issue happens even if you launch just one container? You're saying gMSA is not working on Nano Server at all?

Sep 11 '24 16:09 vrapolinario

Thank you so much for clarifying. I will share my observation after blocking traffic on port 139.

As for the Nano Server issue, can you please clarify: The issue happens even if you launch just one container? You're saying gMSA is not working on Nano Server at all?

We have a bunch of ASP.NET services. We use Negotiate/Kerberos authentication middleware. If I use an ASP.NET nanoserver image that is using Windows build from ~March~ April 2024 or later, the Kerberos token exchange is straight up failing and no request is able to get authenticated. You can see SSPI blob exchange functions in the error call stack - see here for the full call stack -> https://github.com/dotnet/runtime/discussions/105567#discussion-6980650

So essentially our web services are not able authenticate using negotiate when using any image from April or later. This does not happen if I launch just one container, but it happens 100% if there are multiple containers. I think I haven't seen this behavior in beefier windowserver image but can't say for sure as we don't generally use them due to their large size.

I have also seen varying behavior depending on whether the container user is ContainerUser or NT AUTHORITY\NetworkService - the issue exists in both the scenarios but manifests differently.

Sep 11 '24 19:09 avin3sh

@avin3sh a little off topic, but you may want to look at my project that can seamlessly translate tokens from jwt to kerberos and vice versa. It's often used as sidecar and it doesn't require container to be domain joined - it uses kerberos.net library under the covers which is a managed implementation instead of relying on sspi.

https://github.com/NMica/NMica.Security

Sep 11 '24 22:09 macsux

@vrapolinario I tried this with Port 139 blocked like so (for TCP, UDP, Inbound and Outbound):

New-NetFirewallRule -DisplayName "Block Port 139" -Direction Inbound -LocalPort 139 -Protocol TCP -Action Block

But the problem persisted.

Any chance the customer who tried this had large number of domain controllers in their environment ? We have seen that as long as your deployment replicas is less than or equal to number of domain controllers in the environment, you typically don't run into this issue.

Sep 12 '24 13:09 avin3sh

We are happy to collaborate with you to test out various scenarios/experimental patches/etc. We already have a Microsoft Support case ongoing (@ntrappe-msft may be familiar) but it hasn't moved in several months - if you want to take a look at our case, more than willing to validate any suggestions that you may have for this problem.

Sep 12 '24 13:09 avin3sh

Windows-Containers Windows-Containers copied to clipboard

Using gMSA on multiple containers simultaneously causes Domain Trust Relationship to fail

Windows-Containers
Windows-Containers copied to clipboard