calico
calico copied to clipboard
VM access was blocked when eBPF dataplane used
When I enabled the Calico eBPF dataplane for a K8s cluster, the VMs(for which the NIC was bridged on the physical NIC of the server) on the node which had been configured with the eBPF dataplane can't be accessed with normal ssh access. When the kube-proxy was restored and eBPF DP disabled, the SSH access to VM was also restored.
Expected Behavior
Current Behavior
Possible Solution
Steps to Reproduce (for bugs)
The following script was used to enable eBPF dataplane: #!/bin/bash set -x
WORKDIR=$(pwd) TMP_DIR=$(mktemp -d) MARCH=$(uname -m) CALICO_VERSION=${1:-3.23.2}
if [ $MARCH == "aarch64" ]; then ARCH=arm64; elif [ $MARCH == "x86_64" ]; then ARCH=amd64; else ARCH="unknown"; fi echo ARCH=$ARCH
k8s_ep=$(kubectl get endpoints kubernetes -o wide | grep kubernetes | cut -d " " -f 4) k8s_host=$(echo $k8s_ep | cut -d ":" -f 1) k8s_port=$(echo $k8s_ep | cut -d ":" -f 2)
cat <<EOF > ${WORKDIR}/k8s_service.yaml kind: ConfigMap apiVersion: v1 metadata: name: kubernetes-services-endpoint namespace: kube-system data: KUBERNETES_SERVICE_HOST: "KUBERNETES_SERVICE_HOST" KUBERNETES_SERVICE_PORT: "KUBERNETES_SERVICE_PORT" EOF sed -i "s/KUBERNETES_SERVICE_HOST/${k8s_host}/" ${WORKDIR}/k8s_service.yaml sed -i "s/KUBERNETES_SERVICE_PORT/${k8s_port}/" ${WORKDIR}/k8s_service.yaml kubectl apply -f ${WORKDIR}/k8s_service.yaml
echo "Disable kube-proxy:" kubectl patch ds -n kube-system kube-proxy -p '{"spec":{"template":{"spec":{"nodeSelector":{"non-calico": "true"}}}}}'
if [ ! -f /usr/local/bin/calicoctl ]; then echo "No calicoctl, install now:" curl -L https://github.com/projectcalico/calico/releases/download/v${CALICO_VERSION}/calicoctl-linux-${ARCH} -o ${WORKDIR}/calicoctl; chmod +x ${WORKDIR}/calicoctl; sudo cp ${WORKDIR}/calicoctl /usr/local/bin; rm ${WORKDIR}/calicoctl fi
echo "Enable eBPF:" calicoctl patch felixconfiguration default --patch='{"spec": {"bpfEnabled": true}}' --allow-version-mismatch
echo "Enable Direct Server Return(DSR) mode: optional" #calicoctl patch felixconfiguration default --patch='{"spec": {"bpfExternalServiceMode": "DSR"}}'
Context
I try to access the VM(10.169.210.139) which was located in a server with Calico eBPF enabled from another server(10.169.242.130), only the first ping packet can be received, and other ping packets were lost.
The conntrack for the Calico node showed the ssh access (from 10.169.242.130) to VM(10.169.210.139):
# calico-node -bpf conntrack dump |grep "10.169.210.139"
2022-07-15 08:21:37.276 [INFO][13703] confd/maps.go 433: Loaded map file descriptor. fd=0x7 name="/sys/fs/bpf/tc/globals/cali_v4_ct2"
ConntrackKey{proto=6 10.169.242.130:61701 <-> 10.169.210.139:22} -> Entry{Type:0, Created:17278773931441431, LastSeen:17278777015499210, Flags:
Your Environment
- Calico version: v3.23.2
- Orchestrator version (e.g. kubernetes, mesos, rkt): K8s 1.22.1
- Operating System and version: Ubuntu 20.04 focal Linux kernel 5.10.0
- Link to your project (optional):
CC @tomastigera
I first met this issue on an arm64 platform, but it seems there is no such issue on some other platforms or systems, e.g, for some x86 systems. I checked the eBPF output log by setting bpfLogLevel to Debug, the output showed the differences between the 2 kinds of cases. We met this issue on an arm64 platform, but it seems there is no such issue on x86 platform. I checked the log output carefully for these 2 systems:
- For arm64 platform:
4869 <idle>-0 [088] d.s. 1810775.267212: bpf_trace_printk: enp9s0---I: New packet at ifindex=2; mark=0 4870 4871 <idle>-0 [088] d.s. 1810775.267213: bpf_trace_printk: enp9s0---I: No metadata is shared by XDP 4872 4873 <idle>-0 [088] d.s. 1810775.267215: bpf_trace_printk: enp9s0---I: IP id=13695 s=aa9d0e5 d=aa9d287 4874 4875 <idle>-0 [088] d.s. 1810775.267217: bpf_trace_printk: enp9s0---I: ICMP; type=8 code=0 4876 4877 <idle>-0 [088] d.s. 1810775.267218: bpf_trace_printk: enp9s0---I: CT-1 lookup from aa9d0e5:0 4878 4879 <idle>-0 [088] d.s. 1810775.267219: bpf_trace_printk: enp9s0---I: CT-1 lookup to aa9d287:0 4880 4881 <idle>-0 [088] d.s. 1810775.267221: bpf_trace_printk: enp9s0---I: CT-1 Hit! NORMAL entry. 4882 4883 <idle>-0 [088] d.s. 1810775.267222: bpf_trace_printk: enp9s0---I: CT-1 result: 0x2003 4884 4885 <idle>-0 [088] d.s. 1810775.267223: bpf_trace_printk: enp9s0---I: conntrack entry flags 0x100 4886 4887 <idle>-0 [088] d.s. 1810775.267223: bpf_trace_printk: enp9s0---I: CT Hit 4888 4889 <idle>-0 [088] d.s. 1810775.267224: bpf_trace_printk: enp9s0---I: Entering calico_tc_skb_accepted_entrypoint 4890 4891 <idle>-0 [088] d.s. 1810775.267226: bpf_trace_printk: enp9s0---I: IP id=13695 s=aa9d0e5 d=aa9d287 4892 4893 <idle>-0 [088] d.s. 1810775.267226: bpf_trace_printk: enp9s0---I: Entering calico_tc_skb_accepted 4894 4895 <idle>-0 [088] d.s. 1810775.267227: bpf_trace_printk: enp9s0---I: src=aa9d0e5 dst=aa9d287 4896 4897 <idle>-0 [088] d.s. 1810775.267228: bpf_trace_printk: enp9s0---I: post_nat=0:0 4898 4899 <idle>-0 [088] d.s. 1810775.267228: bpf_trace_printk: enp9s0---I: tun_ip=0 4900 4901 <idle>-0 [088] d.s. 1810775.267229: bpf_trace_printk: enp9s0---I: pol_rc=1 4902 4903 <idle>-0 [088] d.s. 1810775.267230: bpf_trace_printk: enp9s0---I: sport=0 4904 4905 <idle>-0 [088] d.s. 1810775.267230: bpf_trace_printk: enp9s0---I: flags=20 4906 4907 <idle>-0 [088] d.s. 1810775.267231: bpf_trace_printk: enp9s0---I: ct_rc=3 4908 4909 <idle>-0 [088] d.s. 1810775.267231: bpf_trace_printk: enp9s0---I: ct_related=0 4910 4911 <idle>-0 [088] d.s. 1810775.267232: bpf_trace_printk: enp9s0---I: mark=0x1000000 4912 4912 4913 <idle>-0 [088] d.s. 1810775.267233: bpf_trace_printk: enp9s0---I: ip->ttl 64 4914 4915 <idle>-0 [088] d.s. 1810775.267234: bpf_trace_printk: enp9s0---I: marking enp9_SKB_MARK_BYPASS 4916 4917 <idle>-0 [088] d.s. 1810775.267235: bpf_trace_printk: enp9s0---I: IP id=13695 s=aa9d0e5 d=aa9d287 4918 4919 <idle>-0 [088] d.s. 1810775.267235: bpf_trace_printk: enp9s0---I: FIB family=2 4920 4921 <idle>-0 [088] d.s. 1810775.267236: bpf_trace_printk: enp9s0---I: FIB tot_len=0 4922 4923 <idle>-0 [088] d.s. 1810775.267237: bpf_trace_printk: enp9s0---I: FIB ifindex=2 4924 4925 <idle>-0 [088] d.s. 1810775.267237: bpf_trace_printk: enp9s0---I: FIB l4_protocol=1 4926 4927 <idle>-0 [088] d.s. 1810775.267238: bpf_trace_printk: enp9s0---I: FIB sport=0 4928 4929 <idle>-0 [088] d.s. 1810775.267238: bpf_trace_printk: enp9s0---I: FIB dport=0 4930 4931 <idle>-0 [088] d.s. 1810775.267239: bpf_trace_printk: enp9s0---I: FIB ipv4_src=aa9d0e5 4932 4933 <idle>-0 [088] d.s. 1810775.267240: bpf_trace_printk: enp9s0---I: FIB ipv4_dst=aa9d287 4934 4935 <idle>-0 [088] d.s. 1810775.267240: bpf_trace_printk: enp9s0---I: Traffic is towards the host namespace, doing Linux FIB lookup 4936 4937 <idle>-0 [088] d.s. 1810775.267243: bpf_trace_printk: enp9s0---I: FIB lookup succeeded - with neigh 4938 4939 <idle>-0 [088] d.s. 1810775.267244: bpf_trace_printk: enp9s0---I: Got Linux FIB hit, redirecting to iface 2. 4940 4941 <idle>-0 [088] d.s. 1810775.267245: bpf_trace_printk: enp9s0---I: Traffic is towards host namespace, marking with 0x3000000. 4942 4943 <idle>-0 [088] d.s. 1810775.267247: bpf_trace_printk: enp9s0---I: Final result=ALLOW (0). Program execution time: 31307ns 4944 4945 <idle>-0 [088] d.s. 1810775.267249: bpf_trace_printk: enp9s0---E: New packet at ifindex=2; mark=3000000 4946 4947 <idle>-0 [088] d.s. 1810775.267250: bpf_trace_printk: enp9s0---E: Final result=ALLOW (3). Bypass mark bit set. 4948
For other systems(x86 currently), the log showed:
<idle>-0 [014] ..s. 17619198.981271: 0: eno1np0--I: New packet at ifindex=2; mark=0 <idle>-0 [014] ..s. 17619198.981271: 0: eno1np0--I: No metadata is shared by XDP <idle>-0 [014] ..s. 17619198.981272: 0: eno1np0--I: IP id=53367 s=aa9d0e5 d=aa9d27f <idle>-0 [014] ..s. 17619198.981273: 0: eno1np0--I: ICMP; type=8 code=0 <idle>-0 [014] ..s. 17619198.981273: 0: eno1np0--I: CT-1 lookup from aa9d0e5:0 <idle>-0 [014] ..s. 17619198.981274: 0: eno1np0--I: CT-1 lookup to aa9d27f:0 <idle>-0 [014] ..s. 17619198.981275: 0: eno1np0--I: CT-1 Hit! NORMAL entry. <idle>-0 [014] ..s. 17619198.981275: 0: eno1np0--I: CT-1 result: 0x2 <idle>-0 [014] ..s. 17619198.981276: 0: eno1np0--I: conntrack entry flags 0x100 <idle>-0 [014] ..s. 17619198.981276: 0: eno1np0--I: CT Hit <idle>-0 [014] ..s. 17619198.981277: 0: eno1np0--I: Entering calico_tc_skb_accepted_entrypoint <idle>-0 [014] ..s. 17619198.981277: 0: eno1np0--I: IP id=53367 s=aa9d0e5 d=aa9d27f <idle>-0 [014] ..s. 17619198.981278: 0: eno1np0--I: Entering calico_tc_skb_accepted <idle>-0 [014] ..s. 17619198.981278: 0: eno1np0--I: src=aa9d0e5 dst=aa9d27f <idle>-0 [014] ..s. 17619198.981279: 0: eno1np0--I: post_nat=0:0 <idle>-0 [014] ..s. 17619198.981279: 0: eno1np0--I: tun_ip=0 <idle>-0 [014] ..s. 17619198.981279: 0: eno1np0--I: pol_rc=1 <idle>-0 [014] ..s. 17619198.981280: 0: eno1np0--I: sport=0 <idle>-0 [014] ..s. 17619198.981280: 0: eno1np0--I: flags=20 <idle>-0 [014] ..s. 17619198.981280: 0: eno1np0--I: ct_rc=2 <idle>-0 [014] ..s. 17619198.981281: 0: eno1np0--I: ct_related=0 <idle>-0 [014] ..s. 17619198.981281: 0: eno1np0--I: mark=0x1000000 <idle>-0 [014] ..s. 17619198.981281: 0: eno1np0--I: ip->ttl 64 <idle>-0 [014] ..s. 17619198.981282: 0: eno1np0--I: IP id=53367 s=aa9d0e5 d=aa9d27f <idle>-0 [014] ..s. 17619198.981283: 0: eno1np0--I: FIB family=2 <idle>-0 [014] ..s. 17619198.981283: 0: eno1np0--I: FIB tot_len=0 <idle>-0 [014] ..s. 17619198.981283: 0: eno1np0--I: FIB ifindex=2 <idle>-0 [014] ..s. 17619198.981283: 0: eno1np0--I: FIB l4_protocol=1 <idle>-0 [014] ..s. 17619198.981284: 0: eno1np0--I: FIB sport=0 <idle>-0 [014] ..s. 17619198.981284: 0: eno1np0--I: FIB dport=0 <idle>-0 [014] ..s. 17619198.981284: 0: eno1np0--I: FIB ipv4_src=aa9d0e5 <idle>-0 [014] ..s. 17619198.981284: 0: eno1np0--I: FIB ipv4_dst=aa9d27f <idle>-0 [014] ..s. 17619198.981285: 0: eno1np0--I: Traffic is towards the host namespace, doing Linux FIB lookup <idle>-0 [014] ..s. 17619198.981287: 0: eno1np0--I: FIB lookup failed (FIB problem): 7. <idle>-0 [014] ..s. 17619198.981287: 0: eno1np0--I: Traffic is towards host namespace, marking with 0x1000000. <idle>-0 [014] ..s. 17619198.981288: 0: eno1np0--I: Final result=ALLOW (0). Program execution time: 16040ns vhost-3084463-3084499 [008] .... 17619198.981418: 0: eno1np0--E: New packet at ifindex=2; mark=0 vhost-3084463-3084499 [008] .... 17619198.981419: 0: eno1np0--E: IP id=42046 s=aa9d27f d=aa9d0e5
The test process is the same for 2 systems: we just ping a VM in a host which had enabled Calico/ebpf dataplane from another host. For arm64 platform, the ping packet can't reach the VM since it had been falsely forwarded by the eBPF program (forward_or_drop function). The differences here lies on the result of FIB lookup, for x86 platform, the FIB lookup failed with code 7, then marked with 0x1000000; for arm64 platform, the FIB lookup succeeded with neigh given, then marked with 0x3000000 and re-appeared on the egress direction of the same interface.
I think for the packet destined for VMs instead of the host itself, it should be checked if it's actually for the host itself by checking the eBPF route map first. If the lookup result for route is unknown, it should be thought as NOT destined for this host and to be ok(TC_ACT_OK) to skip subsequent eBPF processing here.
I saw there is a similar processing for the unrelevant traffic in Cilium eBPF implementation: ep = lookup_ip4_endpoint(ip4); https://github.com/cilium/cilium/blob/master/bpf/bpf_host.c#L571
and if (!from_host) return CTX_ACT_OK; https://github.com/cilium/cilium/blob/master/bpf/bpf_host.c#L586
Here the endpoint of Cilium eBPF is similar to the route of Calico eBPF.
I will put up a PR to address this issue and thanks for your review.
The used versions of Calico: v3.23.2, v3.24.1 and v3.25.0-0.dev.
@tomastigera @mazdakn could you guys please take a look?
@TrevorTaoARM sorry for not responding sooner, totally missed this, :eyes: now! And thanks for a great analysis! :pray:
@TrevorTaoARM I commented at your patch :arrow_up:
The differences here lies on the result of FIB lookup, for x86 platform, the FIB lookup failed with code 7, then marked with 0x1000000; for arm64 platform, the FIB lookup succeeded with neigh given, then marked with 0x3000000 and re-appeared on the egress direction of the same interface.
It seems like the packets ultimately ended up on the egress of the same device regardless of whether the FIB failed or not. But I am not quite sure how the packet looks like in the ARM case as that is missing in the logs when the BYPASS mark is set. Perhaps the host mangled that packet?
The differences here lies on the result of FIB lookup, for x86 platform, the FIB lookup failed with code 7, then marked with 0x1000000; for arm64 platform, the FIB lookup succeeded with neigh given, then marked with 0x3000000 and re-appeared on the egress direction of the same interface.
It seems like the packets ultimately ended up on the egress of the same device regardless of whether the FIB failed or not. But I am not quite sure how the packet looks like in the ARM case as that is missing in the logs when the BYPASS mark is set. Perhaps the host mangled that packet?
The differences here lies on the result of FIB lookup, for x86 platform, the FIB lookup failed with code 7, then marked with 0x1000000; for arm64 platform, the FIB lookup succeeded with neigh given, then marked with 0x3000000 and re-appeared on the egress direction of the same interface.
It seems like the packets ultimately ended up on the egress of the same device regardless of whether the FIB failed or not. But I am not quite sure how the packet looks like in the ARM case as that is missing in the logs when the BYPASS mark is set. Perhaps the host mangled that packet?
@tomastigera Yes, the difference of fib lookup results between the 2 platforms really confused me. But it looks like only when eBPF is enabled, the packet flow for a certain VM would be blocked. I didn't know when the BYPASS mark is set, what the subsequent data path for the packet is. The only trace I saw was:
4945
which showed the packet had been transfered to the egress direction, but for x86, the packet is still in the ingress direction:
@tomastigera Fixed but not complete version v3.25.0-0.dev-490-g3b818a2f1494 schema eth0(without IP) ---- bond0(10.208.201.15/24) ---- app(port 2200)
-0 [005] dNs3. 220.318140: bpf_trace_printk: eth0-----I: New packet at ifindex=2; mark=0 -0 [005] dNs3. 220.318151: bpf_trace_printk: eth0-----I: No metadata is shared by XDP -0 [005] dNs3. 220.318152: bpf_trace_printk: eth0-----I: IP id=0 s=a97d428 d=ad0c90f -0 [005] dNs3. 220.318153: bpf_trace_printk: eth0-----I: TCP; ports: s=50634 d=2200 -0 [005] dNs3. 220.318153: bpf_trace_printk: eth0-----I: CT-6 lookup from a97d428:50634 -0 [005] dNs3. 220.318154: bpf_trace_printk: eth0-----I: CT-6 lookup to ad0c90f:2200 -0 [005] dNs3. 220.318155: bpf_trace_printk: eth0-----I: CT-6 Miss for TCP SYN, NEW flow. -0 [005] dNs3. 220.318156: bpf_trace_printk: eth0-----I: CT-6 result: NEW. -0 [005] dNs3. 220.318156: bpf_trace_printk: eth0-----I: conntrack entry flags 0x0 -0 [005] dNs3. 220.318157: bpf_trace_printk: eth0-----I: NAT: 1st level lookup addr=ad0c90f port=2200 protocol=6. -0 [005] dNs3. 220.318158: bpf_trace_printk: eth0-----I: NAT: Miss. -0 [005] dNs3. 220.318160: bpf_trace_printk: eth0-----I: Host RPF check src=a97d428 skb iface=2 strict if 3 -0 [005] dNs3. 220.318161: bpf_trace_printk: eth0-----I: Host RPF check src=a97d428 skb iface=2 fib rc 0 -0 [005] dNs3. 220.318161: bpf_trace_printk: eth0-----I: Host RPF check src=a97d428 skb iface=2 result 0 -0 [005] dNs3. 220.318162: bpf_trace_printk: eth0-----I: Final result=DENY (0). Program execution time: 10037ns
dropped by RPF check with BPFEnforceRPF=Disabled
-0 [005] d.s3. 6710.121268: bpf_trace_printk: eth0-----I: TCP; ports: s=52905 d=2200 -0 [005] d.s3. 6710.121269: bpf_trace_printk: eth0-----I: CT-6 lookup from a97d428:52905 -0 [005] d.s3. 6710.121270: bpf_trace_printk: eth0-----I: CT-6 lookup to ad0c90f:2200 -0 [005] d.s3. 6710.121271: bpf_trace_printk: eth0-----I: CT-6 Miss for TCP SYN, NEW flow. -0 [005] d.s3. 6710.121274: bpf_trace_printk: eth0-----I: CT-6 result: NEW. -0 [005] d.s3. 6710.121275: bpf_trace_printk: eth0-----I: conntrack entry flags 0x0 -0 [005] d.s3. 6710.121277: bpf_trace_printk: eth0-----I: NAT: 1st level lookup addr=ad0c90f port=2200 protocol=6. -0 [005] d.s3. 6710.121280: bpf_trace_printk: eth0-----I: NAT: Miss. -0 [005] d.s3. 6710.121282: bpf_trace_printk: eth0-----I: Host RPF check disabled -0 [005] d.s3. 6710.121284: bpf_trace_printk: eth0-----I: Post-NAT dest IP is local host. -0 [005] d.s3. 6710.121285: bpf_trace_printk: eth0-----I: About to jump to policy program. -0 [005] d.s3. 6710.121285: bpf_trace_printk: eth0-----I: HEP with no policy, allow. -0 [005] d.s3. 6710.121287: bpf_trace_printk: eth0-----I: Entering calico_tc_skb_accepted_entrypoint -0 [005] d.s3. 6710.121288: bpf_trace_printk: eth0-----I: Entering calico_tc_skb_accepted -0 [005] d.s3. 6710.121289: bpf_trace_printk: eth0-----I: src=a97d428 dst=ad0c90f -0 [005] d.s3. 6710.121290: bpf_trace_printk: eth0-----I: post_nat=ad0c90f:2200 -0 [005] d.s3. 6710.121291: bpf_trace_printk: eth0-----I: tun_ip=0 -0 [005] d.s3. 6710.121297: bpf_trace_printk: eth0-----I: pol_rc=1 -0 [005] d.s3. 6710.121298: bpf_trace_printk: eth0-----I: sport=52905 -0 [005] d.s3. 6710.121299: bpf_trace_printk: eth0-----I: flags=24 -0 [005] d.s3. 6710.121300: bpf_trace_printk: eth0-----I: ct_rc=0 -0 [005] d.s3. 6710.121301: bpf_trace_printk: eth0-----I: ct_related=0 -0 [005] d.s3. 6710.121302: bpf_trace_printk: eth0-----I: mark=0x1000000 -0 [005] d.s3. 6710.121304: bpf_trace_printk: eth0-----I: ip->ttl 57 -0 [005] d.s3. 6710.121307: bpf_trace_printk: eth0-----I: Allowed by policy: ACCEPT
@Dimonyga Not sure whether this is related to the original issue, however, if you apply bpf programs to eth0 in this setup, then surely you cannot pass a strict RPF because routing says that the return path is via bond0 and not eth0. So the bpfDataIfacePattern must not include eth0 and must include bond0 Note that is also much more logically correct. However, there is an issue that if you change the pattern, programs from eth0 are not cleared. You can either remove them manually or reboot the nodes. This issue is addressed by https://github.com/projectcalico/calico/pull/7008
sorry my mistakes The task sounded a little different eth0(no IP) ---- bond0(SUBNET1) --- bond0.208@bond0(SUBNET2)---- application(port 2200)
When we start calico-node with bpfdataifacepattern:^(bond.*|tunl0$|wireguard.cali$|vxlan.calico$) Access to SUBNET2 denied I am passing bpfenforcerpf:Disabled parameter And access restored. In this case, in the debug output, all packets that should be placed in bond0.208 are dropped at the bond0 level. Suggestion to skip packages with vlanid!=0