Windows-Containers icon indicating copy to clipboard operation
Windows-Containers copied to clipboard

Container networking on Kubernetes broken after Server 2022 July 2024 / KB5040437 (OS Build 20348.2582) update

Open avin3sh opened this issue 1 year ago • 64 comments

Describe the bug Pod networking breaks after installing the July CU on Windows Server 2022. For eg, ping microsoft.com from within the container returns General failure. The pod is not reachable from the other pods or through a Service.

Uninstalling KB5040437 fixes the issue.

To Reproduce

  • Setup a Windows worker with Calico VXLAN CNI provider (https://docs.tigera.io/calico/latest/getting-started/kubernetes/windows-calico/manual-install/standard)
  • Install the July cummulative update on the Server 2022 worker
  • Exec into a running container on the worker
  • ping or curl any address

Expected behavior

The pod should be able to reach to external network as well should be reachable from other pods

Configuration:

  • Edition: Windows Server
  • Base Image being used: Windows Server Core
  • Container engine: containerd
  • Container Engine version: 1.6.31

/label Windows on Kubernetes

avin3sh avatar Jul 17 '24 19:07 avin3sh

+@jsturtevant who saw this in https://github.com/kubernetes/test-infra/pull/33042

avin3sh avatar Jul 20 '24 04:07 avin3sh

@grcusanz / @kestratt have either of you seen this Issue popping up for you?

ntrappe-msft avatar Jul 22 '24 23:07 ntrappe-msft

I'm having issues with my AKS testing with latest images. it looks to be related

zylxjtu avatar Jul 23 '24 16:07 zylxjtu

Folks over Calico re-pointed to this issue suggesting the issue isn't at Calico's end (https://github.com/projectcalico/calico/issues/9019#issuecomment-2248938484).

Currently our workers can't have up to date Security patches because of this. I noticed the ADO label so I am hoping we will have some update soon 🤞

avin3sh avatar Jul 25 '24 12:07 avin3sh

We are having this exact same issue with our windows deployments which are using mcr.microsoft.com/dotnet/framework/aspnet:4.8.1-windowsservercore-ltsc2022.

Any update or suggestion on a fix would be greatly appreciated.

kysu1313 avatar Jul 25 '24 18:07 kysu1313

@kysu1313 Do you have the Windows patch KB5040437 installed too?

ntrappe-msft avatar Jul 31 '24 18:07 ntrappe-msft

I do have same issue. After uninstaling KB5040437 - network conectivity is established.

orest-gulman avatar Jul 31 '24 21:07 orest-gulman

@ntrappe-msft Can you confirm if we can expect a fix in August CUs ? We have been holding off upgrading to July CU but leaving our cluster unpatched for two or more months consequently has security concerns.

avin3sh avatar Aug 01 '24 06:08 avin3sh

@avin3sh We're getting this assigned to an engineer right now. Once we do that, they can inform everyone of what the timeline looks like.

ntrappe-msft avatar Aug 01 '24 18:08 ntrappe-msft

3 weeks went, and we’re not only have a fix for a bug, that making windows containers unusable, but don’t even have a timeline. That looks very strange

Nova-Logic avatar Aug 09 '24 08:08 Nova-Logic

@Nova-Logic Sorry for the delay, we know this is a big blocker. We've switched it to a new engineer and should have an update to provide next week.

ntrappe-msft avatar Aug 09 '24 18:08 ntrappe-msft

@avin3sh how are you uninstalling KB5040437? I received this error when attempting to uninstall via wusa /uninstall /kb:5040437 /norestart: "Security Update for Microsoft Windows (KB5040437) is required by your computer and cannot be uninstalled."

I also have the General failure ping errors on a fully updated Windows Server 2019 as well.. Calico seems completely broken for Windows in general right now..

vemsec avatar Aug 12 '24 15:08 vemsec

No mention of this issue in today's patches. I am guessing this was not addressed ?

avin3sh avatar Aug 13 '24 20:08 avin3sh

How is this still not fixed? We can't update any of our windows nodes as the patch can't even be uninstalled..

davidgiga1993 avatar Aug 14 '24 05:08 davidgiga1993

I just tried and can confirm the August patch / KB5041160 / does not fix the issue. The patch contains Important CVEs which leaves our cluster potentially vulnerable if not patched. @ntrappe-msft I appreciate an engineer is already assigned this issue but is it possible for us to get some update on the fix ?

avin3sh avatar Aug 14 '24 09:08 avin3sh

We are coming to the end of another week, can we please have the update we were promised

We've switched it to a new engineer and should have an update to provide next week.

avin3sh avatar Aug 16 '24 11:08 avin3sh

Unfortunately, I don't have news to share yet of a fix. We're waiting on a response from the engineer assigned. We'll bump this Issue up in priority.

ntrappe-msft avatar Aug 16 '24 21:08 ntrappe-msft

Any update? At least a rough estimate / schedule? Currently k8s windows container network is simply broken and not usable. We soon are forced to terminate all our windows nodes as we can't patch them anymore due to this issue.

davidgiga1993 avatar Aug 21 '24 12:08 davidgiga1993

We are a large customer of Windows Containers and are deeply concerned that this issue remains unresolved.

Neither the July nor August security updates even acknowledge this issue under the "Known issues in this update" section.

We are curious what criteria a Containers issue must meet to warrant expedited support and official mention in monthly updates. Does "everything about container networking is broken after July" not meet these criteria?

The support on this problem so far has raised several internal questions about stability of Windows Containers as a platform. The way Microsoft handles this problem will dictate how seriously we would be able to take Windows Containers for any initiatives going forward.

beedle2017 avatar Aug 22 '24 09:08 beedle2017

It's really sad, but I believe we should admit this: 1)Since fix still not available it seems Microsoft don't have sufficient resources to support it and to continue it's development 2)Windows containers are not and would not be a production-grade solution. Release of that CU's that broke container networking is the clear evidence that Microsoft just had not tested that CU with windows containers(or not tested it properly just relying on the fact that if container started—all is ok) 3)Those, who relied on it should migrate to powershell dsc/terraform/both due to p2

It's hard to ruin product reputation more than Microsoft did — release the CU that broke container networking and then just ghost the customers, for more than a month. MS even didn't bothered (or it's possible that actually MS still didn't fully aware of the problem) to write about the issues in known problems.

We(I mean community) can try to check if Microsoft cares about this product by spreading that insane story everywhere across dev/devops/tech bloggers and look at MS reaction.

Nova-Logic avatar Aug 22 '24 11:08 Nova-Logic

As we head into another week, do we have any new update ? As we inch closer to next month's patches, the growing uncertainty about the fix means we will have to force the hosts to update anyway and look at some alternative for hosting the workloads - can't leave the Windows workers unpatched for three months in a row.

All of this tedious, extra work can be avoided or at least planned better if there is some transparency on how Windows Containers team is planning to tackle this issue.

If this issue is affecting even the official sig-windows Kubernetes e2e tests, not prioritizing this problem paints a very bad picture of Windows Containers as a product, for both existing and future potential customers.

I tried some experimentation with Docker Swarm with overlay networking but couldn't reproduce this specific scenario, which seems to suggest the issue might be specific to encapsulation mode or ACLs on HNS Endpoints -- but again my guess as is as good as anyone else's and without some insights into the issue from the product team, it is difficult to even think of a workaround.

avin3sh avatar Aug 25 '24 10:08 avin3sh

27 August, still no fix

Nova-Logic avatar Aug 27 '24 12:08 Nova-Logic

I apologize for my ignorance, but I'd really appreciate if someone here in the community can clarify the nature and scope of this issue for me.

My understanding from the thread above is that Microsoft's July update for Windows Server 2022 has somehow borked networking for Windows pods/containers deployed to Kubernetes nodes running that version of Windows Server. However, do we know the extent to which the various local/cloud flavours of Kubernetes environment(s) might affected? For example, has anyone observed this same behaviour when using the latest versions of the Amazon "Kubernetes optimized AMIs" in EKS, or similar counterparts in AKS?

As for what might be causing the issue, I wonder if there is a potential for some underlying dependency issue with the [versions of the] tools used to build the Windows container images themselves? For example, the version/patching of the Windows base image that the container is built from?

Regardless, the apparent lack of any cogent response from Microsoft is it's definitely... disquieting.

jwilsonCX avatar Aug 27 '24 16:08 jwilsonCX

@jwilsonCX yes we're using the aws optimized eks images, same issue. Although we're not using the Amazon CNI but rather calico which uses the windows HNS features

davidgiga1993 avatar Aug 27 '24 16:08 davidgiga1993

@jwilsonCX I've bare Kubernetes deployed in hyper-v VMs with nested virtualization. Using calico+vxlan. Have cluster containing 3 master , 3 linux worker,2 win worker nodes.on one of the nodes(looks randomly) containers does not having network(Ping transmit: general failure).Seems it's somehow should be related to HNS. Also I had tried to use both old and after-patch build-servers, and older/newer images, but that had not helped

Nova-Logic avatar Aug 27 '24 17:08 Nova-Logic

Thanks for those replies, @davidgiga1993 and @Nova-Logic. We're running Windows containers in EKS, but are using the Amazon CNI. I've been holding off making any changes/updates since this ticket was opened because I'm afraid of downing our working (quasi-production) cluster. Was really hoping for more clarity from MS as to what the heck is going on before submitting ourselves as guinea pigs.

jwilsonCX avatar Aug 27 '24 18:08 jwilsonCX

Hi All, we are aware of this issue and are actively working to track down the root cause.  I'll report back on this thread before the end of this week, or sooner if I get actionable information to share.

grcusanz avatar Aug 28 '24 18:08 grcusanz

Hi @grcusanz, are you in a position to better describe the exact nature and scope of the problem as you understand it at this time? For example, is it limited to HNS implementations as some have posited above, or is CNI impacted too?

jwilsonCX avatar Aug 28 '24 18:08 jwilsonCX

Hi everyone, please follow these steps and comment to let me know if it resolves the issue with the July or August update installed.

  1. Open regedit (Registry Editor).
  2. Go to: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\hns\State
  3. Add or update the following value to the State key:

Name : FwPerfImprovementChange Type : DWORD Value : 0

  1. [required] Reboot or restart the HNS (Host Networking Service) service.
  2. Test

CAUTION! Network connectivity will be lost to all containers on the node during an HNS restart! Container networking should automatically recover. Please report back if you have a different experience.

JamesKehr avatar Aug 30 '24 15:08 JamesKehr

@JamesKehr at this moment looks like it helped, would continue testing on this weekend and post follow-up on Monday

Nova-Logic avatar Aug 30 '24 16:08 Nova-Logic