talos icon indicating copy to clipboard operation
talos copied to clipboard

Document Vsphere caveats

Open andrewrynhard opened this issue 4 years ago • 7 comments

  • Vsphere 7.x or later is required.
  • VMXNET driver is broken.
  • open-vm-tools or https://github.com/mologie/talos-vmtoolsd is required.

andrewrynhard avatar Feb 12 '21 14:02 andrewrynhard

@alex1989hu @mologie Am I missing anything?

andrewrynhard avatar Feb 12 '21 15:02 andrewrynhard

Oh, your mail reminds me I wanted to file this one. Sorry about that, and thanks for the heads-up!

  • Not sure if vSphere 7.X is really a requirement, I recall @alex1989hu mentioned he'd previously run Talos under 6.X
  • vSphere 7.X contains the vmxnet bug
  • The vmxnet bug affects only VXLAN-based CNIs, including the default Flannel. Calico in IPIP mode works fine.
  • The workaround is to use E1000 instead of vmxnet3.
  • open-vm-tools / talos-vmtoolsd is required only for vSphere CPI+CSI support and clean shutdown; Talos itself works out of the box

mologie avatar Feb 12 '21 15:02 mologie

Thanks @mologie !

andrewrynhard avatar Feb 12 '21 15:02 andrewrynhard

Many conditions shall be met, not just VMXNET3 and vSphere 7.x. I will try to summarize those conditions later.

alex1989hu avatar Feb 12 '21 15:02 alex1989hu

Having the status/caveats at, e.g., https://www.talos.dev/docs/v0.12/virtualized-platforms/vmware/ would be great.

Also showing how to integrate with https://github.com/kubernetes-sigs/vsphere-csi-driver would be pretty awesome.

rgl avatar Aug 26 '21 07:08 rgl

The lack of network connectivity from pods affected my setup on vSphere 7.0u2 with ESXI 7.0u1 hosts. Everything worked fine until (https://github.com/mologie/talos-vmtoolsd) release v0.2 has been deployed on Talos v0.13.0 cluster. Switching the network adapters to E1000 has fixed it.

luqelinux avatar Nov 05 '21 18:11 luqelinux

As someone who ran into this problem recently, I have to admit I agree with the sentiment here. I actually didn't even see this thread until finding my own work around, because these issues aren't clear in the documentation. For those that are curious, you can actually make VXLAN work in vSphere, and without having to move away from VMXNET3 interfaces. Although, the work around below might loose the VXLAN offloading support; I'm not actually sure how to verify.

My experience, which is on VMware ESXi, 7.0.3, 20328353, was that any VXLAN packets going between hosts were just not routing at all. Any communication that was within a single node was fine, but anything attempting to cross nodes would just timeout. All of my my ESXi host network layers and kubernetes hosts are in the same subnet and VLAN, so I could immediately rule out any of those type of issues. Which left me a little stumped, I could ping between hosts but any TCP traffic would just drop.

Once I realized that ESXi was trying to manage VXLAN traffic offloading, I took a shot in the dark that worked out as a good solution. I just changed the flannel configuration to move VXLAN traffic onto a different port. All my problems with VXLAN routing disappeared and things seem to be working fine now.

WORKAROUND:

kubectl edit configmap/kube-flannel-cfg -n kube-system
    # Change data -> net-conf.json -> Backend -> Port to a non-standard  port
    # EG:  "Port": 4799   (Default is 4789)
kubectl rollout restart daemonset/kube-flannel -n kube-system

The only caveat here is that running talosctl upgrade-k8s will revert this configuration. I have yet to find a way to customize the bootstrap manifest for flannel in this regard.

LONG TERM SOLUTION: As a proposed solution here, maybe Talos devs can add cluster config options for customizing net-conf.json? Another good use case here might be better support for flannel backend options. For example, flannel also supports things like host-gw and wireguard, instead of VXLAN. https://github.com/flannel-io/flannel/blob/master/Documentation/backends.md

I do realize that one option is to disable Talos management of flannel, and implement your own custom CNI. However, the Talos implementation is already fairly well configured, and just exposing a few additional options could provide some needed flexibility.

CompPhy avatar Mar 14 '24 13:03 CompPhy

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Sep 11 '24 01:09 github-actions[bot]

This issue was closed because it has been stalled for 7 days with no activity.

github-actions[bot] avatar Sep 16 '24 02:09 github-actions[bot]