calico
calico copied to clipboard
ChecksumOffloadBroken autodetection doesn't necessarily detect all cases
Expected Behavior
Pod-pod and pod-service communication across nodes should work.
Current Behavior
All traffic between pods across nodes is dropped (with the exception of ICMP).
Possible Solution
VMware recommends to either:
- Change the VXLAN port to 8472 (when NSX is not used) or 4789 (when NSX is used)
- Disable the VXLAN hardware offload feature on the VMXNET3 NIC (which recent Linux driver version enable by default)
Since a port change is not feasible for Calico Windows (which requires 4789) disabling the hardware offload feature is the only feasible solution. Since this feature was not even supported by earlier Linux versions for that particular NIC device there is no performance impact of disabling it.
Given that the NIC firmware configuration is not something most users are used to manage i suggest to implement a transparent solution in Calico that disables the offload feature when Calico configures VXLAN on host interfaces backed by a VMXNET3 device. To that effect: It looks like Calico already configures NIC driver settings: https://github.com/projectcalico/felix/blob/master/ethtool/ethtool.go
Steps to Reproduce (for bugs)
- Provision VMs on vSphere version 6.7u2 or later using one of the following operating systems: CentOS/RHEL/Oracle 8.3, SLES 15 SP2/SP3
- Install Kubernetes cluster on the nodes
- Install Calico with VXLAN overlay following official docs, e.g.:
- https://docs.projectcalico.org/networking/vxlan-ipip
- https://docs.projectcalico.org/getting-started/windows-calico/kubernetes/standard
Context
VXLAN packets are dropped on the Linux network stack due to incorrect checksums of inner packets. These incorrect checksums occur when enabling VXLAN hardware offload on the VMXNET3 interface (which recent Linux version do by default) and creating a VXLAN overlay network in the guest OS on ports other than 8472 (when NSX is not used) or 4789 (when NSX is used).
References:
- https://github.com/rancher/rancher/issues/33399
- https://bugzilla.redhat.com/show_bug.cgi?id=1935539
- https://bugzilla.redhat.com/show_bug.cgi?id=1941714 (not public)
Your Environment
- Calico version 3.19.1
- Orchestrator version: Kubernetes 1.19.12 (RKE)
- Operating System and version: CentOS/RHEL 8.3, SLES 15 SP2
VXLAN offload works with many 10G NICs, disabling by default will hurt performance for those, and each card can have different offload toggle, for the qede driver + IPIP you need to disable all offload, not just tx-udp_tnl-csum-segmentation for exemple.
Good point, but the issue at hand is completely limited to vSphere infrastructure, so the fix would/should also only apply to the specific type of NIC used there (VMXNET3). The goal is not to solve all knowns issue in relation to Calico IPIP or VXLAN but to restore compatibility with what is undoubtedly a very mainstream and widespread infrastructure.
Thanks @janeczku. So IIUC there is a workaround to disable hardware offloading on those specific NICs that can be done prior to installing Calico for Windows. Perhaps another way is to document this issue and workaround for Calico vSphere users on https://docs.projectcalico.org
cc @song-jiang
Is there a good way to detect these NICs? If so, we could arrange for ChecksumOffloadBroken to be set int hat case: https://github.com/projectcalico/felix/blob/master/iptables/feature_detect.go#L116
Note: Calico feature detction can be overridden with config by setting an override in the FelixConfiguration resource:
featureDetectOverride: "ChecksumOffloadBroken=true"
It should either be documented or the workaround should be applied automatically in Felix using the approach described by @fasaxc above.
Yes, they can be detected by determining NIC model and hw revision via ethtool syscalls
The bug is actually in the new linux driver for vmxnet3. So probably instead of detecting the specific hardware revision (which i am not sure is exposed over ethtool) it would be enough to detect that it uses the buggy driver version.
Sometimes the bug is with the driver + firmware combination, it's endless. Best thing would be to have Calico send packets using raw sockets and receive them on another node and see if the checksums are correct, ie really test that it's working.
@fasaxc, et al.,
I have an issue where pods can't communicate with one another across nodes. I've concluded that it's related to this issue.
I was able to verify that on a brand new k3s cluster install adding featureDetectOverride: "ChecksumOffloadBroken=true" to the FelixConfiguration fixes the issue, but I'm unable to get an existing install fixed by applying the change. What needs to be done for the change to take effect?
I have calico installed via the tigera operator v1.23.1 (calico v3.21.0) on k3s v1.21.5+k3s2. OS is Ubuntu 20.04.
-robodude666
I'm hitting this issue on Azure (requires VXLAN) Linux version 5.15.0-1014-azure, using Helm to install Calico in VXLAN mode via operator. Unfortunately, the autodetect doesn't work because my kernel version is > 5.7 (even though Ubuntu 20.04 doesn't appear to have the fix).
However, Calico does not allow configuring Felix directly when using the operator: https://projectcalico.docs.tigera.io/reference/felix/configuration
It would be great if we could either:
- Improve the ChecksumOffloadBroken to not rely on a simple kernel version check (since not all distributions have the fix backported) - this would be my preferred solution
- Allow configuring Felix via operator / Helm chart values
Hm, that's a bummer that the auto-detection isn't working on newer kernels.
If you have installed Calico using the operator, you cannot modify the environment provided to felix directly. To configure felix, see the FelixConfiguration resource instead.
If you're using the operator, you should look at https://projectcalico.docs.tigera.io/reference/resources/felixconfig to use REST API-based configuration instead of environment variables.
You should be able to modify the default FelixConfiguration resource to set:
spec.featureDetectOverride: "ChecksumOffloadBroken=true"
You should be able to modify the default FelixConfiguration resource
@caseydavenport that's what I'm doing for now and it seems to make the tests happy: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/1a1fa22e8947ba7805e029a279c85af325c2e32b/templates/addons/calico/felix-override.yaml
Do you know if there is a way to do this directly via the Helm chart though? It'd be easier if I could set the featureDetectOverride in values.yaml instead of having to modify the default FelixConfigurations resource via kubectl apply after the helm install. Maybe I'm missing something?
After doing some research across many GitHub issues on this kernel bug I found https://github.com/rancher/rke2-charts/blob/main-source/packages/rke2-calico/generated-changes/overlay/templates/felixconfig.yaml, seems like rancher folks are doing some sort of overlay to extend the upstream calico template to allow configuring Felix in values.yaml. Would it be valuable to add something like it directly in the official Calico Helm chart?
Thanks so much for the answer and for all your work on the project btw, I've gone through a lot of Calico issues the past few days and your comments were very helpful!
Thanks for the pointer to that overlay file! I didn't realize that.
However, this line . . . Looks like https://github.com/projectcalico/calico/issues/6412 strikes again!
Would it be valuable to add something like it directly in the official Calico Helm chart?
It definitely would, and were it not for the problems discussed in the above issue I'd probably just do that right now. To be honest I'm tempted to do it anyway since the default FelixConfiguration is a singleton and this would be a nice UX improvement and would actually be abstracted behind helm's values.yaml "API" anyway... I will mull on that :)
Thanks so much...
You're very welcome! and I really appreciate the kind words :smile_cat:
Hey @caseydavenport have you given this any more thought? Looks like others are running into this as well from issue mentions
@fasaxc has a PR which will always disable the offload here: https://github.com/projectcalico/calico/pull/6842
That's probably the best way for now.
Is there a good way to detect these NICs? If so, we could arrange for ChecksumOffloadBroken to be set int hat case: https://github.com/projectcalico/felix/blob/master/iptables/feature_detect.go#L116
Note: Calico feature detction can be overridden with config by setting an override in the FelixConfiguration resource:
featureDetectOverride: "ChecksumOffloadBroken=true"
this only works for VXLAN, not for IPIP;
@fredkan see above, we decided to disable it by default in more recent versions.