Windows-Containers icon indicating copy to clipboard operation
Windows-Containers copied to clipboard

Using gMSA on multiple containers simultaneously causes Domain Trust Relationship to fail

Open avin3sh opened this issue 11 months ago • 19 comments

Describe the bug When running multiple containers simultaneously using the same gMSA on either the same host or different hosts, it causes one or multiple containers to lose their domain trust relationship leading to various issues including LsaLookUp and negotiate auth failures. This especially happens when the count of containers is equal to or more than count of domain controllers in the environment. However, it is also possible to run into this issue when the count of containers is less than count of domain controllers in the environment, provided two or more containers attempt to talk to the same domain controller.

To Reproduce

  1. Build an image from the following Dockerfile
FROM mcr.microsoft.com/dotnet/aspnet:6.0-windowsservercore-ltsc2019 AS base

USER ContainerAdministrator
RUN reg.exe add "HKLM\SYSTEM\CurrentControlSet\Control\Lsa" /v LsaLookupCacheMaxSize /t REG_DWORD /d 0 /f

USER ContainerUser
ENTRYPOINT ["powershell.exe", "1..500 | %{ [void][System.Security.Principal.NTAccount]::new('contoso\\someobj').Translate([System.Security.Principal.SecurityIdentifier]).Value; Start-Sleep -Milliseconds (Get-Random -Minimum 100 -Maximum 1000); }"]

Replace contoso\someobj above with sam name of an actual object.

  1. Run the container image simultaneously on multiple hosts using the following command. To increase the chances of running into the issue, if there are N domain controllers in the environment, run the container image simultaneously on at least N+1 hosts
docker run --security-opt "credentialspec=file://gmsa-credspec.json" --hostname <gMSAName>  -it <image>

Replace <gMSAName> with actual gMSA and file://gmsa-credspec.json with actual gMSA Credential Spec file and <image> with the container image

  1. Monitor the output of all the containers, eventually one or more containers will start throwing the following error message. This usually happens within first few seconds of the container starting, assuming the docker run ... in (2) above was run simultaneously on different hosts. If it does not happen, repeat (2) until it does.

    Exception calling "Translate" with "1" argument(s): "The trust relationship between this workstation and the primary domain failed.

    While a running container is throwing the above error message in its output, exec into it and try performing some domain operation - that will fail as well.

Expected behavior gMSAs on multiple Windows Containers is officially supported since at least Windows Server 2019. Running gMSA on multiple containers simultaneously should not result in trust relationship to fail.

Configuration:

  • Edition: Windows Server 2022
  • Base Image being used: Windows Server Core
  • Container engine: docker

Additional context

  • While the reproducer uses a PowerShell base image to demonstrate the bug, we had originally run into this issue in an ASP.NET Core web application while performing negotiate authentication.

  • The container image in the reproducer purposefully disables LSA LookUp Cache by setting LsaLookupCacheMaxSize to 0 to simplify the example.

  • If you were to observe traffic of a container that has run into this issue, the packet capture will indicate a lot of DSERPC/RPC_NETLOGON failure messages. You may also observe packets reporting nca_s_fault_sec_pkg_error image

  • Sometimes the container may "autorecover". It is purely a chance event. Whenever this happens, you can see RPC_NETLOGON packets in the network capture. Typically this results in the container recovering its domain trust relationship only when the NETLOGON happens through a different domain controller than what container had earlier communicated to. image

  • It is also possible to re-establish domain trust relationship of a failing container by running the following command in a failing container (the runtime user should be a ContainerAdministrator or should have administrators privileges)

    nltest.exe /sc_reset:contoso.com
    

    If the above command does not succeed, you may have to run it more than once. When the command succeeds, more often than not, all the affected containers and not just the current container "recover".

  • As mentioned in the bug description, it is very easy to run into this issue when the count of containers is more than the number of domain controllers in the environment but that is not the only scenario.

  • docker run ... is not the only way to run into this issue. It can be also be reproduced on an orchestration platform like Kubernetes, by setting replicas count of the Deployment to N+1; or by using scaling feature.

avin3sh avatar Aug 02 '23 18:08 avin3sh

Hi, thanks for bringing this issue to our attention. First, I've have to give credit where credit is due. This is so well written up! Thank you for providing a very clear description of the current and expected behavior.

Second, this is a quick question: Is there a reason why all the containers in this cluster all have the same gMSA?

ntrappe-msft avatar Aug 04 '23 18:08 ntrappe-msft

Is there a reason why all the containers in this cluster all have the same gMSA?

We actually don't use the same gMSA for all the containers in the cluster. Different type of application containers run with different gMSAs.

The problem arises when there are multiple instances (replicas) of the same application, such as an application that requires to be highly available. During my testing I also found that it does not have to be replicas of same container image/deployment, different containers still running as the same gMSA will also run into this issue.

Multiple containers running as same gMSA can't be avoided for these purposes - without them we can't distribute our workload or promise high availability.

avin3sh avatar Aug 07 '23 11:08 avin3sh

@ntrappe-msft has there been an internal confirmation of this bug and any discussions on a fix ? This issue severely limits ability to scale Windows containers and use AD authentication because of direct relation between number of containers and domain controllers.

avin3sh avatar Sep 06 '23 12:09 avin3sh

Hi, thank you for your patience! We know this is blocking you right now and we're working hard to make sure it's resolved as soon as possible. We've reached out to the gMSA team to get more context on the problem and some troubleshooting suggestions.

ntrappe-msft avatar Sep 06 '23 16:09 ntrappe-msft

The gMSA team is still doing their investigation but they can confirm that this is unexpected and unusual behavior. We may ask for some logs in the future if it would help them diagnose the root cause.

ntrappe-msft avatar Sep 20 '23 20:09 ntrappe-msft

Hi, could you give us a few follow-up details?

  • Are you using process-isolated or hyper-v isolated containers?
  • Are you using the same container hostname and gMSA name?
  • What is the host OS version?

ntrappe-msft avatar Nov 06 '23 18:11 ntrappe-msft

Hi Nicole @ntrappe-msft

Are you using process-isolated or hyper-v isolated containers?

Process Isolation

Are you using the same container hostname and gMSA name?

Correct

What is the host OS version?

Microsoft Windows Server 2022 Standard (Core), with October CU applied

Sharing some more data from our experiments, in case it help the team to troubleshoot the issue:

  1. When all the containers of a gMSA are given a different, unique, value for the hostname, at least the Domain Trust Relationship error goes away - although that may have broken something else, we did not look in that direction. However;

  2. If the value of hostname for each container is >15 characters in length, and the value is unique BUT first 15 characters are not-unique, we again start seeing the issue related to Domain Trust Relationship. This interestingly coincides with 15 character length limit for computer name / NETBIOS limitation.

    This means if you have a very long value of hostname and first few characters are not unique, gMSA issues start occurring in multi-container scenario.

    If you were to use some container orchestration solution, like Kubernetes, the value of pod name, which is what gets supplied as hostname value to the container runtime, is in all the realistic scenarios >15 characters and the first few characters are common for each pod (deployment name + replicaset ID) -- this would cause problem with gMSAs in that case as well

  3. Just out of curiosity, instead of docker runtime, I directly used containerd and I could reproduce the problem there as well

  4. Not specifying hostname when launching containers with same gMSA does not give this error, I believe the container runtime internally gives some random ID as the value for hostname in that case (scenario (1) above) -- that seem to imply the problem here is multiple container having same name ?

    In context of containers with gMSA, having same name as gMSA name has been the norm for a while. Not specifying hostname isn't always possible, explicitly specifying hostname shouldn't break the status quo, and when using orchestration solutions, like the example I listed above, the user has no direct control on the value of hostname.

This issue has been severely restricting usage of Windows Containers at scale :(

avin3sh avatar Nov 07 '23 07:11 avin3sh

🔖 ADO 47828389

ntrappe-msft avatar Nov 23 '23 01:11 ntrappe-msft

While we appreciate that the Containers team is still looking into this issue, I wanted to share some insights into just how seemingly difficult this problem is to work around.

In order to prevent requests landing on "bad" containers, I was trying to write custom aspnet core health check that could inquire the status of Trust Relationship of the container and mark the service as unhealthy when Domain Trust fails. What seemed to be a very straightforward tempory fix/compromise for our problems turned out to be a complex anomaly:

  • Firstly, netapi32 DLL is not available in nanoserver, and won't be until next major release of Windows Server - https://github.com/microsoft/Windows-Containers/issues/72#issuecomment-1569257600
  • If we have the Server Core image as the base image and have the DLL moved to the nanoserver container, we could work around this but only to run into more problems
  • Within the gMSA container - the Win32 call will not automatically pick the Netlogon Policy Server
  • And if you do hardcode a domain controller for this purpose, the netlogon query response would still indicate that the trust relationship exists (NERR_Success as opposed to something like RPC_S_SERVER_UNAVAILABLE) - and this is while the container is actively reporting trust errors while performing AD operations
  • And even if we had managed to get all of this to work, to "repair" the Secure Channel we would have to run our container as ContainerAdministrator which introduces bunch of other security concerns
  • PowerShell commands such as Test-ComputerSecureChannel simply fail, because the interpretation of "hostname" is different within a gMSA Container vs. outside of it - where the command is typically used
  • In essence, any of the means to [programmatically] catch gMSA and Domain Trust issues for Containers, like ones documented at https://kubernetes.io/docs/tasks/configure-pod-container/configure-gmsa/#troubleshooting, turned out to be unhelpful

My guesses for why the usual means to troubleshoot gMSA/Trust problems are not working for us is probably an attempted to fix a VERY SIMILAR problem for Containers in Server 2019:

We changed the behavior in Windows Server 2019 to separate the container identity from the machine name, allowing multiple containers to use the same gMSA simultaneously.

Since we do not understand how this was achieved, we have again reached a dead end and are desperately hoping the Containers team is able to solve our gMSA-Containers-At-Scale problem

avin3sh avatar Dec 27 '23 16:12 avin3sh

Thanks for the additional details. We've had a number of comments from internal and external teams struggling with the same issue. Our support team is still working to find a workaround that they can publish.

ntrappe-msft avatar Jan 25 '24 21:01 ntrappe-msft

Support team is still working on this. We'll make sure we also update our "troubleshoot gMSAs" documentation when we can address the problem.

ntrappe-msft avatar Feb 05 '24 21:02 ntrappe-msft

We're also running into this issue, we're using Windows Server 2019 container images, however there are no multiple container instances running with the same gMSA however we still get the same error about trust. Our case is that we try to login with an AD user it doesn't work, but the gMSA does work, should I raise a ticket with support for assistance.

Update:

  • All of our containers have the same host name even if they run using different gMSAs
  • Using a different name for the containers does not solve the issue

israelvaldez avatar Feb 28 '24 21:02 israelvaldez

Hello @ntrappe-msft - is Containers team in touch with the gMSA/CCG group. Our support engineers informed us that we are the only ones who have reported this issue, but based on your confirmation in https://github.com/microsoft/Windows-Containers/issues/405#issuecomment-1911045014, and assuming from reactions on this issue, it is clear there are many users who have run into this exact problem.

Our case is that we try to login with an AD user it doesn't work, but the gMSA does work, should I raise a ticket with support for assistance.

@israelvaldez, see my above comment. I would think it is worth highlighting this problem to Microsoft Support from your end as well, so that that it is obvious, without any doubt, that multiple customers face this and it could be appropriately prioritized (if not already)

avin3sh avatar Mar 06 '24 12:03 avin3sh

Hi @ntrappe-msft we are also experiencing the same issue with our gMSA containers intermittently losing trusts with our domain and needs to be restarted. Wondering if Microsoft has any update on this issue.

We have multiple container instances running the same app and using the gMSA. Interestingly even though each of them have their own unique hostname defined, the log shows it's connecting to the DC using the gMSA name as MachineName. Host/domain/dc names replaced with **.

EventID : 5720 MachineName : gmsa_** Data : {138, 1, 0, 192} Index : 1309 Category : (0) CategoryNumber : 0 EntryType : Error Message : The session setup to the Windows Domain Controller \** for the domain ** failed because the computer gmsa_** does not have a local security database account. Source : NETLOGON ReplacementStrings : {\**, **, **} InstanceId : 5720 TimeGenerated : 13/03/2024 10:23:24 AM TimeWritten : 13/03/2024 10:23:24 AM UserName : Site : Container :

WillsonAtJHG avatar Mar 18 '24 01:03 WillsonAtJHG

@avin3sh you are definitely not the only one experiencing this Issue. There are a number of internal teams who would like to increase the severity of this Issue and attention towards it. I'm crossing my fingers that we'll have a positive update soon. But it does help us if more people comment on this threat highlighting that they too are encountering this problem.

ntrappe-msft avatar Mar 20 '24 23:03 ntrappe-msft

This is a huge issue for us at Broadcom with multiple fortunate 100 customers wanting this feature in one of our products and thousands of workloads being blocked from being migrated off VMs to containers

macsux avatar Mar 27 '24 15:03 macsux

In my scenario I created a new gMSA othern than the one I was using (which was not being used in multiple pods) and I was able to workaround this problem. i.e. my pod had gmsa1, I created gmsa2 and suddenly the trust betweent he pod and the domain was fine.

israelvaldez avatar Apr 03 '24 16:04 israelvaldez

The workaround is appreciated, but we would like to see Microsoft fix this issue directly so that customers do not need to significantly redesign their environments.

julesroussel3 avatar Apr 03 '24 18:04 julesroussel3

This issue has been fortunate enough to not get attention of auto-reminder bots so far, but I am afraid they will be here anytime soon. I see this has been finally assigned, does it mean a fix is in the works ?

avin3sh avatar May 03 '24 14:05 avin3sh