Windows-Containers icon indicating copy to clipboard operation
Windows-Containers copied to clipboard

`--isolation=process` fails sporadically on machines using SplunkForwarder

Open augusew1 opened this issue 1 year ago • 47 comments

UPDATE

Please see this comment for the latest on this bug

Original Report:

Describe the bug --isolation=process does not work on Windows Server 2022 with an equivalent image of windows-servercore

To Reproduce

  1. Install the latest docker engine using the script from this repository
  2. Do docker pull mcr.microsoft.com/windows/servercore:ltsc2022
  3. Verify that the docker image is the same build number as the container host
  4. Do docker run -it mcr.microsoft.com/windows/servercore:ltsc2022

The container will fail with the following error:

docker: Error response from daemon: container 08a82aa4af2577260662b370a64b7b1584430103cf17c007971f73bdbac58cdc encountered an error during hcs::System::Start: context deadline exceeded. 

Expected behavior The container should run

Configuration:

  • Edition: Windows Server 2022 Data Center (10.0.20348)
  • Base Image being used: Windows Server Core (ltsc2022, sha256: 29e69ab16e91)

Output of docker info:

Client:
 Version:    24.0.2
 Context:    default
 Debug Mode: false

Server:
 Containers: 13
  Running: 0
  Paused: 0
  Stopped: 13
 Images: 21
 Server Version: 24.0.2
 Storage Driver: windowsfilter
  Windows:
 Logging Driver: json-file
 Plugins:
  Volume: local
  Network: ics internal l2bridge l2tunnel nat null overlay private transparent
  Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog
 Swarm: inactive
 Default Isolation: process
 Kernel Version: 10.0 20348 (20348.1.amd64fre.fe_release.210507-1500)
 Operating System: Microsoft Windows Server Version 21H2 (OS Build 20348.1668)
 OSType: windows
 Architecture: x86_64
 CPUs: 6
 Total Memory: 12GiB
 Name: aplops-win-02
 ID: e324222e-b334-4a15-ba5a-734bec1f5d5a
 Docker Root Dir: C:\ProgramData\Docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine

Additional context

  • --isolation=hyperv does work
  • This container host is actually a VM, being hosted on VMWare ESXi
  • The base image for the VM was provided by my company, they have installed additional security software to the system, but I have had extensive conversations with my IT department to confirm all relevant exceptions and exemptions are in place for docker to work properly
  • Process isolation was working before the KB5026370 update, however uninstalling it does not solve the issue. There was an additional update "Service Stack 10.0. 20348.1720" applied at the same time as KB5026370 that cannot be uninstalled.

Currently, our best guess is some combination of the following is creating an issue:

  • ESXi hypervisor
  • KB5026370
  • --isolation=process

We are working on testing this on a bare metal machine to see if ESXi issues can be ruled out.

augusew1 avatar Jun 02 '23 17:06 augusew1

Hi! Thanks for posting your issue.

When I usually see errors with hcs:System::Start, they're often because of resource constraints (e.g. too little memory), networking, or incompatible host/container types. I did a very simple repro using a Windows Server Core LTSC 2022 container using the flag --isolation=process on Windows Server 2022 Data Center (10.0.20348). That was successful.

However, it's interesting that you mentioned everything worked fine before the May 2023 (KB5026370) update. I'm going to create an internal ticket (#44926683) to see whether the update is causing any issues or if we can suggest any mitigation strategies.

We'll keep you posted when we have an update.

ntrappe-msft avatar Jun 05 '23 21:06 ntrappe-msft

Resources on this system should not be an issue, it as 12 GB of RAM and 6 vCPU cores. Space on the disk should not be an issue either, it is ~100GB. We have still not had a chance to reproduce the issue on bare metal, i.e. without ESXi.

augusew1 avatar Jun 06 '23 11:06 augusew1

I agree that it doesn't seem like a resource issue. I used fewer cores and less memory and it started up for me. If you feel up to the challenge, there are 2 things you could try doing. I'll keep trying to repro your error in the meantime.

Task 1: Get a fresh Windows Server Core 2022 image.

  • Remove the current image: docker rm mcr.microsoft.com/windows/servercore:ltsc2022
  • Get the most recent: docker pull mcr.microsoft.com/windows/servercore:ltsc2022

Task 2: See what's timing out and blocking the container start up process.

  • Since you're using a process isolated container, there may be deadlock which is causing the start up process to timeout.
  • Open Task Manager.
  • Under "Details", turn on extra columns like "Session ID".
  • Try to find the most recent session ID that would correspond to the container trying to boot up.
  • See if any of the processes under than session ID are deadlocked. (Or screenshot the processes you see).

Sidenote. I know that the second one can be confusing. We don't expect our customers to troubleshoot their own problems, but, if you CAN find the processes related to your container session and share them, we may get some insights as to what's going on.

ntrappe-msft avatar Jun 07 '23 18:06 ntrappe-msft

Task 1 we have attempted (including spinning up a brand new, fresh VM and installing docker from scratch). As for task 2, I am more than willing to troubleshoot, my issue is that I don’t know where to start. This is good information that I did not know and I will attempt it this week. We are also working in parallel to stand up our WS2022 image on some loaner hardware to see if it’s something about the VM that is causing this. Unfortunately, due to internal policy requirements, this will take some time.

One question I did have: hcsshim seems to have some specific troubleshooting guidelines, are any of these relevant here? I know https://github.com/microsoft/hcsshim exists as well.

augusew1 avatar Jun 07 '23 19:06 augusew1

I did get a hit on the Session ID. It looks like a process called smss.exe starts every time I run docker, in a brand new session.

snss exe

augusew1 avatar Jun 09 '23 01:06 augusew1

@ntrappe-msft We were able to test this on some loaner hardware, just a laptop. It was not connected to our domain so some security software (in monitor-only mode) was not running, but it should be a 1:1 test otherwise. The test of docker run --isolation=process with this fresh laptop worked.

With the release of KB5027225 and KB5027544, our test VM was updated to include these. Furthermore, a new Windows Server Core ltsc2022 image was released (a0cb8066ecaf ). We removed all docker images on the VM and re-downloaded them to match what was on the laptop. Still, process isolation did not work. The docker version was still 24.0.2 on both.

Presently, we have the following situation (assuming ltsc2022@a0cb8066ecaf on both):

  • 10.0.20348 w/ latest patches as of 6/17/23 on a laptop: --isolation=process does work
  • 10.0.20348 w/ latest patches as of 6/17/23 on VMWare ESXi: --isolation=process does not work

So, we currently believe this to be some bug with WS2022 guests on ESXi. While we do have some evidence that this was working before KB5026370, we have not been able to reproduce a working setup by just reverting the KB patch. Any advice for further debugging would be greatly appreciated, as I think we are stuck on our end. The timeout is still happening with smss.exe being the only process stated in that session.

Another thing to add: I am more than willing to do some deep, technical troubleshooting and provide logs or additional information relevant about our infrastructure. But as of right now, I don't really know what is useful or relevant.

augusew1 avatar Jun 18 '23 19:06 augusew1

Hmm ok this is interesting. I'll do a bit more digging with the new information you've given. Thanks for being so open to help debug, I may take you up on that if I can't repro the error you're getting.

ntrappe-msft avatar Jun 19 '23 17:06 ntrappe-msft

We're currently waiting on a VM with the VMWare hypervisor to repro on our end. Thanks for your patience.

ntrappe-msft avatar Jun 27 '23 16:06 ntrappe-msft

This issue has been open for 30 days with no updates. @ntrappe-msft, please provide an update or close this issue.

Still working on this. I was OOF for 2 weeks so I apologize for the delay.

ntrappe-msft avatar Jul 31 '23 22:07 ntrappe-msft

I appreciate the update, we've also been working on our end to open up a ticket with VMWare, it has taken our virtualization team a lot of convincing and back-and-forth to convince them this is a bug on their end.

augusew1 avatar Aug 03 '23 18:08 augusew1

Hi, I'm currently waiting on an Azure VM running VMware. Once I have that, we'll try to repro your bug!

ntrappe-msft avatar Aug 15 '23 18:08 ntrappe-msft

This issue has been open for 30 days with no updates. @ntrappe-msft, please provide an update or close this issue.

Hi @ntrappe-msft ,

Just wanted to follow up with what we've found on our end. We have reached out to VMWare, but have been told that we must get support via Docker, as VMWare doesn't support individual applications, even with their production support. They said Docker would have to troubleshoot with them directly, after we confirm with their support that this is a VMWare issue. In truth, while evidence points to it, we don't exactly know if VMWare is the real issue. We can't debug any further on that end.

They also pointed us to this documentation from Docker claiming that nested virtualization is "not supported". (I know that page is for Docker desktop, but IIRC it's still running the same container runtime underneath, perhaps with some proprietary changes?).

The above is problematic for us for a couple reasons:

  • We have no desire to use Docker Desktop just to get support from Docker for an OS that Docker Desktop does not support
  • Our VMWare support license does not allow us to troubleshoot with VMWare directly, so our debugging efforts have reached a deadend

Given the above, we have concluded that Docker on Windows with nested virtualization is too difficult to support for our company at this time, and have instead procured physical hardware. ~~Feel free to leave this issue open or close, we currently have no reason to believe this is a Microsoft issue.~~ See below

augusew1 avatar Sep 19 '23 14:09 augusew1

I apologize for backtracking, but now that we have the physical hardware set up, it seems as if we were wrong. Process isolation still is not working. Since this has nothing to do with VMWare, we're going back to potentially investigation our security/compliance software.

Unfortuantely there's not much I can share about our infrastructure without permission, but please see the below:

System Info

Docker info

Client:
Version:    24.0.6
Context:    default
Debug Mode: false

Server:
Containers: 7
 Running: 0
 Paused: 0
 Stopped: 7
Images: 31
Server Version: 24.0.6
Storage Driver: windowsfilter
 Windows:
Logging Driver: json-file
Plugins:
 Volume: local
 Network: ics internal l2bridge l2tunnel nat null overlay private transparent
 Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog
Swarm: inactive
Default Isolation: process
Kernel Version: 10.0 20348 (20348.1.amd64fre.fe_release.210507-1500)
Operating System: Microsoft Windows Server Version 21H2 (OS Build 20348.1970)
OSType: windows
Architecture: x86_64
CPUs: 40
Total Memory: 255.9GiB
Name: aplops-win-1
ID: 6bb5c255-fbff-434d-8de7-08b1e8b8e59e
Docker Root Dir: F:\docker
Debug Mode: false
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

Servercore image

mcr.microsoft.com/windows/servercore                                      ltsc2022   2b76bb46ced7   2 weeks ago     4.11GB

System specs

Processor information

Caption           : Intel64 Family 6 Model 63 Stepping 2
DeviceID          : CPU0
Manufacturer      : GenuineIntel
MaxClockSpeed     : 2600
Name              : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
SocketDesignation : CPU1

Caption           : Intel64 Family 6 Model 63 Stepping 2
DeviceID          : CPU1
Manufacturer      : GenuineIntel
MaxClockSpeed     : 2600
Name              : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
SocketDesignation : CPU2

Memory

256 GB

Storage

2 TB HDD

OS Version

Microsoft Windows [Version 10.0.20348.1970]

I would greatly appreciate some more information on HCS, specifically what syscalls or Windows features its calling in to that could be affected by:

  • Group Policy
  • Antivirus software
  • blocked dlls (via the above)
  • Missing features

augusew1 avatar Sep 20 '23 13:09 augusew1

Thank you for doing such a thorough investigation and attempting to resolve this on your end. Our intuition is that this probably isn't an issue with the VMware hypervisor but something else. I'll try to repro this using all the specs you provided and hopefully we can see your error. Otherwise, I'll work on getting you the information we're allowed to share on HCS.

ntrappe-msft avatar Sep 20 '23 20:09 ntrappe-msft

Hi @ntrappe-msft

Curious update that may either a) make everything clearer on Microsoft's side, or b) make everything more confusing on Microsof'ts side. Hopefully the former! I have been able to reproduce getting --isolation=process to work and to fail. Once our machine is rebooted, --isolation=process will fail. However, if we install and then remove Hyper-V as a Windows Feature, then reboot, it will work.

Reproduction steps

  1. Reboot machine
  2. Confirm --isolation=process times out with hcs::System::Start: context deadline exceeded.
  3. Confirm Hyper-V is OFF
  4. Run: Install-WindowsFeature -Name Hyper-V -Restart
  5. Once rebooted, run Remove-WindowsFeature -Name Hyper-V -Restart
  6. Once the machine has rebooted again, try docker, and --isolation=process will work.

I have successfully tested this 3 times in a row. I'm pretty baffled, but it does work. Note that process isolation still feels slow, it takes about 30 seconds to a minute to actually start the container. I am guessing this is security software related.

EDIT: Correction, rebooting the machine between installing/removing Hyper-V is consistent. Installing/Removing in the same boot sometimes works.

augusew1 avatar Oct 06 '23 13:10 augusew1

Sorry for the delay, I'm working on using a separate VMware hypervisor to try to repro this. I'll update the Issue next week with my results.

ntrappe-msft avatar Nov 23 '23 00:11 ntrappe-msft

Hi @ntrappe-msft,

We have been doing some renewed debugging at our company looking in to exactly when/how docker is failing, and we've made some progress. To start: it's confirmed not a VMWare issue, or at least not only a VMWare issue.

Second, while toggling Hyper-V on/off does seem to work, it's not as consistent as I first implied. In fact, what really seems to matter is when docker --isolation=process is run. From this, and issue #426, we were able to narrow it down to it being the SplunkForwarder service affecting docker. It appears to be a race condition where if you run docker before this service starts, it will work, and if you run docker after it starts, it will fail. The work around in #426 also works, that is:

  1. Stop SplunkForwarder service
  2. Run docker run --isolation=process .... (confirm that it works and the container loads)
  3. Start SplunkForwarder service

Docker now works for the remainder of the boot, i.e. this process needs to be repeated once per boot. For now, we are switching debugging efforts to investigate what these services possibly have in common with one another and where they overlap. As far as we can tell, SplunkForwarder only needs a port or two open, and does not use any virtualization/containerization features. So this is perhaps a firewalling issue, but it is not obvious why this would cause docker to completely fail. Any insight on Microsoft's side on how this could happen is appreciated. If @brwilkinson has any insights on his end that would also be appreciated. Hopefully we can all share notes and figure this out.

augusew1 avatar Nov 27 '23 13:11 augusew1

Hi, thank you for all the work you're doing debugging! I'm trying to repro the race condition now. Out of curiosity, can I ask what you're using the SplunkForwarder service for?

ntrappe-msft avatar Dec 07 '23 20:12 ntrappe-msft

Just accidentally spotted this, I can confirm having same issue now for years with our windows containers workloads, bottomed down to the splunkforwarder workaround long time ago. We're still using win server 1809 and ltsc1809 container images, got nothing from engaging splunk support and seeing this discussion - seems like doomed to keep using the stop-start workaround in the newer windows versions as well. @ntrappe-msft the company I work for uses splunk forwarder service for ingest logs from hosts into splunk.

ArtyomBaranovskiy avatar Dec 26 '23 02:12 ArtyomBaranovskiy

Thanks for confirming the same problem and that there's a definite issue with the SplunkForwarder service and running containers. Did you happen to see the same behavior in that a container fails to start up if docker is run after the SplunkForwarder service starts up?

ntrappe-msft avatar Dec 27 '23 21:12 ntrappe-msft

We have significant evidence that Splunk prevents docker from starting containers. Here are some mitigation tips while we try to investigate what's going wrong:

Tip 1:

  • Stop the Splunk Service
  • Create at least 1 container successfully (can be a dummy container)
  • Start the Splunk Service
  • Create the containers you actually need

Tip 2:

  • Exclude any and all Windows application logs from the Splunk inputs.conf file

ntrappe-msft avatar Dec 27 '23 21:12 ntrappe-msft

Thanks for confirming the same problem and that there's a definite issue with the SplunkForwarder service and running containers. Did you happen to see the same behavior in that a container fails to start up if docker is run after the SplunkForwarder service starts up?

@ntrappe-msft yes - the containers always fail to start if splunk forwarder is running, however I am only speaking from win server 2019 experience. The reliable way to start any container in my case is to stop splunk forwarder, then start containers, then start splunk forwarder. What you're suggesting in "Tip 1" above certainly doesn't work.

In Tip 2 do you actually mean windows event logs? (we do ingest all event logs too into splunk - there are cases where it's useful, like searching for the certain application crash error across the fleet of hosts).

ArtyomBaranovskiy avatar Dec 29 '23 03:12 ArtyomBaranovskiy

Can we please update the title of this issue to map to the current findings?

brwilkinson avatar Jan 03 '24 19:01 brwilkinson

@ArtyomBaranovskiy I'm sorry to hear that even Tip #1 doesn't work for you (for some people they've had success). And yes for #2 the Windows event logs.

ntrappe-msft avatar Jan 25 '24 21:01 ntrappe-msft

Thanks for everyone's patience. We need to understand more about how the SplunkForwarder could be interfering with Docker so we've reached out to our contacts at Splunk. I'll update this when I learn more.

ntrappe-msft avatar Jan 25 '24 21:01 ntrappe-msft

I didn't want to update this issue before actually testing it myself, but we've found another clue, but since @ntrappe-msft is reaching out to Splunk, then I think this is relevant to bring up. Splunk may be interfering with named pipes, see this page.

Once we have the chance to, I am going to try to use a TCP socket to run docker containers on and see if that works without any workarounds. I encourage anyone who is able to try this to do it. Our main machine is under heavy use during the week but I was going to attempt it this weekend.

augusew1 avatar Jan 26 '24 13:01 augusew1

Listening on a TCP socket did not work. I still got the "context exceeded deadline" error. So, Splunk is somehow interfering with docker at a lower level than the host socket.

augusew1 avatar Jan 27 '24 13:01 augusew1

If this is an issue with the splunk forwarder, it would be more appropriate to raise this on the appropriate channels with them.

fady-azmy-msft avatar Feb 01 '24 18:02 fady-azmy-msft