calico icon indicating copy to clipboard operation
calico copied to clipboard

Support IP fragmentation in eBPF

Open nick-oconnor opened this issue 9 months ago • 10 comments

Expected Behavior

UDP packet fragments destined for a pod's IP which are not denied by policy arrive on the pod's interface.

Current Behavior

The eBPF data plane appears to be dropping UDP packet fragments by policy. The initial fragment is correctly forwarded from the node interface to the pod interface, but subsequent fragments do not appear on the pod's interface. When a UDP packet fragment is dropped, calico's dropped by policy counter for the interface is incremented. The pod interface eventually responds with "fragment reassembly time exceeded".

The only policies I have defined are k8s network policies. This problem does not occur when using the IPTables data plane.

Possible Solution

No idea. There may be a bug in calico's eBPF policy code.

Steps to Reproduce (for bugs)

  1. Enable the eBPF data plane (kube-proxy not running, with or without DSR)
    • BGP w/ no encapsulation + dual stack (I'm unsure if this is relevant, packet captures were all IPv4)
  2. Deploy a pod
  3. Start a packet capture on the node running the pod
  4. Send a fragmented UDP packet to the pod IP (I'm unsure how to replicate this outside of SNMP)

Context

I experienced this behavior after migrating from the IPTables data plane to the eBPF data plane. All SNMP responses exceeding the network's MTU caused my SNMP collector to timeout. I used captures from various points to determine where the packets were being dropped.

Your Environment

  • Calico version: v3.28.0
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes 1.29.4
  • Operating System and version: Ubuntu 24.04 (6.8.0-31 kernel)
  • Relevant Calico config: BGP w/ no encapsulation, dual stack (added in v3.28.0)

nick-oconnor avatar May 14 '24 21:05 nick-oconnor

That is correct observation. Unfortunately, ebpf dataplane does not support ip fragmentation as only the first fragment contains udp ports. The subsequent fragments cannot be matched reliably with the ongoing flow. We cannot reassemble the fragments in eBPF easily (that is a limitation of the technology). This said, we might consider some improvements/workarounds in a future release.

tomastigera avatar May 14 '24 21:05 tomastigera

@tomastigera Wow thanks for the quick reply! Very interesting. Looks like I have some homework regarding eBPF APIs. It'll probably save folks some time by adding this to the eBPF docs for Calico.

nick-oconnor avatar May 14 '24 21:05 nick-oconnor

Related: https://github.com/cilium/cilium/issues/25709#issuecomment-2105977944

nick-oconnor avatar May 14 '24 22:05 nick-oconnor

Thanks for the pointer. Problem with kfunc is that they are in "newer" kernels only and are not necessarily a stable API. But we could perhaps add it for kernels that have that feature! :+1:

tomastigera avatar May 14 '24 22:05 tomastigera

Seems like the patch :arrow_up: is not present in any released kernel :(

tomastigera avatar May 22 '24 20:05 tomastigera

also facing this issue, in my case I noticed that the error only happens when the target is a service IP, if I test from a pod to pod IP it works, would that make sense?

diogenxs avatar Sep 11 '24 22:09 diogenxs

@diogenxs do you have a different MTU on the pod-pod path than on the "default" route as that is probably what decides the MTu for the service path (larger) ? Do you use overlay (vxlan) ? What is the MTU on your devices?

tomastigera avatar Sep 12 '24 15:09 tomastigera

do you have a different MTU on the pod-pod path than on the "default" route

no, both routes pod and services, goes to the same path

Do you use overlay (vxlan)?

no, encapsulation: None set on Installation of tigera-operator

What is the MTU on your devices?

1472, I tried to force Calico to use a lower one but didn't have any lucky

I'm advertising the services IPs trough BGP, same as pods, one thing I noticed is that some services have a "static" entry at node level, what is the desired state here? does every service IP need to be a static entry as well?

# ip r  | grep 10.226.2
10.226.2.10 via 169.254.1.1 dev bpfin.cali 
10.226.2.20 via 169.254.1.1 dev bpfin.cali 

diogenxs avatar Oct 21 '24 17:10 diogenxs

No, just UDP services when BPFConnectTimeLoadBalancing is set to TCP and for all if it is set to Disabled (iirc) we had a fix for MTU issues with bpfin.cali for 3.28.1 so if might be worth upgrading if you have not done so yet. Are only the services with the static entries affected? https://github.com/projectcalico/calico/pull/8922

tomastigera avatar Oct 21 '24 21:10 tomastigera

just UDP services when BPFConnectTimeLoadBalancing is set to TCP

this is true, all static entries are services running with UDP, DNS servers basically

Are only the services with the static entries affected?

yes, only UDP, therefore the static entries

I'll upgrade and test again, currently running 3.27.4 :face_with_peeking_eye:

diogenxs avatar Oct 21 '24 21:10 diogenxs