calico icon indicating copy to clipboard operation
calico copied to clipboard

ChecksumOffloadBroken autodetection doesn't necessarily detect all cases

Open janeczku opened this issue 4 years ago • 15 comments

Expected Behavior

Pod-pod and pod-service communication across nodes should work.

Current Behavior

All traffic between pods across nodes is dropped (with the exception of ICMP).

Possible Solution

VMware recommends to either:

  • Change the VXLAN port to 8472 (when NSX is not used) or 4789 (when NSX is used)
  • Disable the VXLAN hardware offload feature on the VMXNET3 NIC (which recent Linux driver version enable by default)

Since a port change is not feasible for Calico Windows (which requires 4789) disabling the hardware offload feature is the only feasible solution. Since this feature was not even supported by earlier Linux versions for that particular NIC device there is no performance impact of disabling it.

Given that the NIC firmware configuration is not something most users are used to manage i suggest to implement a transparent solution in Calico that disables the offload feature when Calico configures VXLAN on host interfaces backed by a VMXNET3 device. To that effect: It looks like Calico already configures NIC driver settings: https://github.com/projectcalico/felix/blob/master/ethtool/ethtool.go

Steps to Reproduce (for bugs)

  1. Provision VMs on vSphere version 6.7u2 or later using one of the following operating systems: CentOS/RHEL/Oracle 8.3, SLES 15 SP2/SP3
  2. Install Kubernetes cluster on the nodes
  3. Install Calico with VXLAN overlay following official docs, e.g.:
  • https://docs.projectcalico.org/networking/vxlan-ipip
  • https://docs.projectcalico.org/getting-started/windows-calico/kubernetes/standard

Context

VXLAN packets are dropped on the Linux network stack due to incorrect checksums of inner packets. These incorrect checksums occur when enabling VXLAN hardware offload on the VMXNET3 interface (which recent Linux version do by default) and creating a VXLAN overlay network in the guest OS on ports other than 8472 (when NSX is not used) or 4789 (when NSX is used).

References:

  • https://github.com/rancher/rancher/issues/33399
  • https://bugzilla.redhat.com/show_bug.cgi?id=1935539
  • https://bugzilla.redhat.com/show_bug.cgi?id=1941714 (not public)

Your Environment

  • Calico version 3.19.1
  • Orchestrator version: Kubernetes 1.19.12 (RKE)
  • Operating System and version: CentOS/RHEL 8.3, SLES 15 SP2

janeczku avatar Jul 07 '21 12:07 janeczku

VXLAN offload works with many 10G NICs, disabling by default will hurt performance for those, and each card can have different offload toggle, for the qede driver + IPIP you need to disable all offload, not just tx-udp_tnl-csum-segmentation for exemple.

champtar avatar Jul 07 '21 14:07 champtar

Good point, but the issue at hand is completely limited to vSphere infrastructure, so the fix would/should also only apply to the specific type of NIC used there (VMXNET3). The goal is not to solve all knowns issue in relation to Calico IPIP or VXLAN but to restore compatibility with what is undoubtedly a very mainstream and widespread infrastructure.

janeczku avatar Jul 07 '21 14:07 janeczku

Thanks @janeczku. So IIUC there is a workaround to disable hardware offloading on those specific NICs that can be done prior to installing Calico for Windows. Perhaps another way is to document this issue and workaround for Calico vSphere users on https://docs.projectcalico.org

cc @song-jiang

lmm avatar Aug 10 '21 16:08 lmm

Is there a good way to detect these NICs? If so, we could arrange for ChecksumOffloadBroken to be set int hat case: https://github.com/projectcalico/felix/blob/master/iptables/feature_detect.go#L116

Note: Calico feature detction can be overridden with config by setting an override in the FelixConfiguration resource:

featureDetectOverride: "ChecksumOffloadBroken=true"

fasaxc avatar Aug 20 '21 12:08 fasaxc

It should either be documented or the workaround should be applied automatically in Felix using the approach described by @fasaxc above.

janeczku avatar Aug 20 '21 12:08 janeczku

Yes, they can be detected by determining NIC model and hw revision via ethtool syscalls

janeczku avatar Aug 20 '21 12:08 janeczku

The bug is actually in the new linux driver for vmxnet3. So probably instead of detecting the specific hardware revision (which i am not sure is exposed over ethtool) it would be enough to detect that it uses the buggy driver version.

janeczku avatar Aug 20 '21 12:08 janeczku

Sometimes the bug is with the driver + firmware combination, it's endless. Best thing would be to have Calico send packets using raw sockets and receive them on another node and see if the checksums are correct, ie really test that it's working.

champtar avatar Aug 20 '21 13:08 champtar

@fasaxc, et al.,

I have an issue where pods can't communicate with one another across nodes. I've concluded that it's related to this issue.

I was able to verify that on a brand new k3s cluster install adding featureDetectOverride: "ChecksumOffloadBroken=true" to the FelixConfiguration fixes the issue, but I'm unable to get an existing install fixed by applying the change. What needs to be done for the change to take effect?

I have calico installed via the tigera operator v1.23.1 (calico v3.21.0) on k3s v1.21.5+k3s2. OS is Ubuntu 20.04.

-robodude666

mistresseve666 avatar Dec 31 '21 20:12 mistresseve666

I'm hitting this issue on Azure (requires VXLAN) Linux version 5.15.0-1014-azure, using Helm to install Calico in VXLAN mode via operator. Unfortunately, the autodetect doesn't work because my kernel version is > 5.7 (even though Ubuntu 20.04 doesn't appear to have the fix).

However, Calico does not allow configuring Felix directly when using the operator: https://projectcalico.docs.tigera.io/reference/felix/configuration

It would be great if we could either:

  1. Improve the ChecksumOffloadBroken to not rely on a simple kernel version check (since not all distributions have the fix backported) - this would be my preferred solution
  2. Allow configuring Felix via operator / Helm chart values

CecileRobertMichon avatar Aug 25 '22 23:08 CecileRobertMichon

Hm, that's a bummer that the auto-detection isn't working on newer kernels.

If you have installed Calico using the operator, you cannot modify the environment provided to felix directly. To configure felix, see the FelixConfiguration resource instead.

If you're using the operator, you should look at https://projectcalico.docs.tigera.io/reference/resources/felixconfig to use REST API-based configuration instead of environment variables.

You should be able to modify the default FelixConfiguration resource to set:

spec.featureDetectOverride: "ChecksumOffloadBroken=true"

caseydavenport avatar Aug 26 '22 23:08 caseydavenport

You should be able to modify the default FelixConfiguration resource

@caseydavenport that's what I'm doing for now and it seems to make the tests happy: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/1a1fa22e8947ba7805e029a279c85af325c2e32b/templates/addons/calico/felix-override.yaml

Do you know if there is a way to do this directly via the Helm chart though? It'd be easier if I could set the featureDetectOverride in values.yaml instead of having to modify the default FelixConfigurations resource via kubectl apply after the helm install. Maybe I'm missing something?

After doing some research across many GitHub issues on this kernel bug I found https://github.com/rancher/rke2-charts/blob/main-source/packages/rke2-calico/generated-changes/overlay/templates/felixconfig.yaml, seems like rancher folks are doing some sort of overlay to extend the upstream calico template to allow configuring Felix in values.yaml. Would it be valuable to add something like it directly in the official Calico Helm chart?

Thanks so much for the answer and for all your work on the project btw, I've gone through a lot of Calico issues the past few days and your comments were very helpful!

CecileRobertMichon avatar Aug 26 '22 23:08 CecileRobertMichon

Thanks for the pointer to that overlay file! I didn't realize that.

However, this line . . . Looks like https://github.com/projectcalico/calico/issues/6412 strikes again!

Would it be valuable to add something like it directly in the official Calico Helm chart?

It definitely would, and were it not for the problems discussed in the above issue I'd probably just do that right now. To be honest I'm tempted to do it anyway since the default FelixConfiguration is a singleton and this would be a nice UX improvement and would actually be abstracted behind helm's values.yaml "API" anyway... I will mull on that :)

Thanks so much...

You're very welcome! and I really appreciate the kind words :smile_cat:

caseydavenport avatar Aug 27 '22 00:08 caseydavenport

Hey @caseydavenport have you given this any more thought? Looks like others are running into this as well from issue mentions

CecileRobertMichon avatar Oct 12 '22 01:10 CecileRobertMichon

@fasaxc has a PR which will always disable the offload here: https://github.com/projectcalico/calico/pull/6842

That's probably the best way for now.

caseydavenport avatar Oct 17 '22 17:10 caseydavenport

Is there a good way to detect these NICs? If so, we could arrange for ChecksumOffloadBroken to be set int hat case: https://github.com/projectcalico/felix/blob/master/iptables/feature_detect.go#L116

Note: Calico feature detction can be overridden with config by setting an override in the FelixConfiguration resource:

featureDetectOverride: "ChecksumOffloadBroken=true"

this only works for VXLAN, not for IPIP;

fredkan avatar Sep 21 '23 13:09 fredkan

@fredkan see above, we decided to disable it by default in more recent versions.

fasaxc avatar Sep 25 '23 08:09 fasaxc