Felix clears XDP programs it doesn't own - XDP used in iptables mode
I'm using XDP for modify packet payload directly on datapath. But also Calico use XDP for failsafe ports and CIDR map.
When each Calico's XDP resync period, my XDP program has been wiped out, our system is not working properly.
Context
Felix wipe out attached XDP program on each XDP resync period. Can I shutdown XDP functionalities in Calico?
Your Environment
I'm using Kubernetes from Kubespray. Calico version is v3.15.2
Problematic Code
Here is code sinppet from Felix. https://github.com/projectcalico/felix/blob/bbbd53863935730011636e6fa9652def9dc48586/dataplane/linux/int_dataplane.go#L449-L476
xdpState is initialize, XDP program has been wiped out.
https://github.com/projectcalico/felix/blob/bbbd53863935730011636e6fa9652def9dc48586/dataplane/linux/int_dataplane.go#L1569-L1595
Again, on resync.
It should be possible to disable XDP in Calico, either by setting XDPEnabled=false in the felixconfig default object or by setting the FELIX_XDPENABLED=false in the env vars passed to Felix (or calico-node). As you've already discovered, this setting defaults to true.
I'm afraid I can't tell you how to do that in kubespray (they may or may not have plumbed through that configuration setting), but its certainly possible in Calico.
https://docs.projectcalico.org/reference/felix/configuration#general-configuration has more information on Felix config options.
It should be possible to disable XDP in Calico, either by setting
XDPEnabled=falsein the felixconfig default object or by setting theFELIX_XDPENABLED=falsein the env vars passed to Felix (or calico-node). As you've already discovered, this setting defaults totrue.I'm afraid I can't tell you how to do that in kubespray (they may or may not have plumbed through that configuration setting), but its certainly possible in Calico.
https://docs.projectcalico.org/reference/felix/configuration#general-configuration has more information on Felix config options.
I'm afraid it doesn't work for me. Actually, Felix's XDPEnabled option control XDP Acceleration on Felix. Regardless of whether the XDP Enabled option is Enabled or not, XDP Refresh function still active.
@fasaxc as the likely writer of that code, does that ^ match your expectations?
That sounds like a bug, we should only clean up our own XDP program. You can disable the resync by setting the XDPRefreshInterval interval to 0 but I think it will still do a resync at start of day.
Your best bet may be to upgrade to v3.17.1 and switch to BPF dataplane mode. In that mode, XDP is disabled because it conflicts with our other BPF programs.
I think v3.16.5 fixed this problem. I'll check the exact point of bug soon.
@fasaxc Is there any solution or workaround on this? And can you confirm that switch to eBPF mode can resolve this?
@ptualek did you try upgrading to latest Calico? @sjh5205 said that that worked for them
@fasaxc we got an error on 3.18.1 and we have to reboot node to recover from issue
Hi, here is some update to this issue.
2021-06-30 22:30:21.839 [INFO][97] felix/xdp_state.go 561: Finished XDP resync. family=4 resyncDuration=36.561192ms 2021-06-30 22:33:17.228 [INFO][1185] felix/xdp_state.go 561: Finished XDP resync. family=4 resyncDuration=146.595408ms 2021-06-30 22:33:20.220 [INFO][1185] felix/xdp_state.go 192: Applying BPF actions did not succeed. Queueing XDP resyn. error=failed to remove XDP program from enp137s0: [remove /sys/fs/bpf/calico/xdp/prefilter_v1_enp137s0: no such file or directory remove /sys/fs/bpf/calico/xdp/prefilter_v1_enp137s0: no such file or directory remove /sys/fs/bpf/calico/xdp/prefilter_v1_enp137s0: no such file or directory] 2021-06-30 22:33:24.528 [INFO][1338] felix/xdp_state.go 561: Finished XDP resync. family=4 resyncDuration=139.259255ms 2021-06-30 22:33:28.534 [INFO][1467] felix/xdp_state.go 561: Finished XDP resync. family=4 resyncDuration=145.982141ms 2021-06-30 22:33:33.036 [INFO][1590] felix/xdp_state.go 561: Finished XDP resync. family=4 resyncDuration=60.157574ms 2021-06-30 22:33:37.137 [INFO][1693] felix/xdp_state.go 561: Finished XDP resync. family=4 resyncDuration=57.801501ms 2021-06-30 22:33:41.932 [INFO][1833] felix/xdp_state.go 561: Finished XDP resync. family=4 resyncDuration=119.836926ms 2021-06-30 22:33:46.037 [INFO][1936] felix/xdp_state.go 561: Finished XDP resync. family=4 resyncDuration=50.914551ms 2021-06-30 22:33:52.834 [INFO][2085] felix/xdp_state.go 561: Finished XDP resync. family=4 resyncDuration=145.410772ms 2021-06-30 22:33:55.760 [INFO][2085] felix/xdp_state.go 192: Applying BPF actions did not succeed. Queueing XDP resyn. error=failed to remove XDP program from enp137s0: [remove /sys/fs/bpf/calico/xdp/prefilter_v1_enp137s0: no such file or directory remove /sys/fs/bpf/calico/xdp/prefilter_v1_enp137s0: no such file or directory remove /sys/fs/bpf/calico/xdp/prefilter_v1_enp137s0: no such file or directory] 2021-06-30 22:34:00.426 [INFO][2238] felix/xdp_state.go 561: Finished XDP resync. family=4 resyncDuration=103.457177ms 2021-06-30 22:34:04.628 [INFO][2348] felix/xdp_state.go 561: Finished XDP resync. family=4 resyncDuration=134.138888ms 2021-06-30 22:34:09.117 [INFO][2477] felix/xdp_state.go 561: Finished XDP resync. family=4 resyncDuration=146.967208ms 2021-06-30 22:34:13.632 [INFO][2601] felix/xdp_state.go 561: Finished XDP resync. family=4 resyncDuration=130.155798ms 2021-06-30 22:34:18.237 [INFO][2707] felix/xdp_state.go 561: Finished XDP resync. family=4 resyncDuration=71.065205ms 2021-06-30 22:34:22.721 [INFO][2853] felix/xdp_state.go 561: Finished XDP resync. family=4 resyncDuration=127.665097ms 2021-06-30 22:34:27.218 [INFO][2959] felix/xdp_state.go 561: Finished XDP resync. family=4 resyncDuration=237.74996ms 2021-06-30 22:34:32.121 [INFO][3102] felix/xdp_state.go 561: Finished XDP resync. family=4 resyncDuration=136.839842ms 2021-06-30 22:34:36.332 [INFO][3208] felix/xdp_state.go 561: Finished XDP resync. family=4 resyncDuration=91.865865ms 2021-06-30 22:34:40.931 [INFO][3351] felix/xdp_state.go 561: Finished XDP resync. family=4 resyncDuration=88.466481ms
As seen in the log above, Felix do XDP Cleanup & Applying action even though XDPEnabled and BPFEnabled option is 'false'. So, I added conditional statement to before initializing XDPState.
Unfortunately projectcalico/felix#2882 isn't right because we need Felix to clean up its own XDP state when XDPEnabled is changed from true to false, and the way that works is (like with most Felix config changes):
- initially XDPEnabled is true
- it's changed to false, either in a new pod environment, or in the datastore
- either way, Felix restarts, and XDPEnabled is now false
- now Felix needs to clean up the state that it established when XDPEnabled was true.
I think we need a more surgical change that only cleans up the maps and programs that Felix installed, and not anyone else's XDP programs.
@neiljerram I think the name of the XDP program is could be helpful. Here's the example of my own.
85: sched_cls tag 95c21ac20566da99 gpl loaded_at 2021-07-02T14:29:48+0900 uid 0 xlated 288B jited 186B memlock 4096B 98: socket_filter name _iptables_tstmp tag 2c571d1de4c2e959 gpl loaded_at 2021-07-02T14:38:27+0900 uid 0 xlated 96B jited 75B memlock 4096B 109: xdp name xdp_print_times tag 095a832833cf21f6 gpl loaded_at 2021-07-05T20:37:07+0900 uid 0 xlated 208B jited 142B memlock 4096B
If we can use a specific prefix, for example '__calico'. We will be able to identify which XDP programs can be removed, and which cannot.
@sjh5205 Agreed, you can see the name that we are currently using here:
func getProgName(ifName string) string {
return fmt.Sprintf("prefilter_%s_%s", xdpProgVersion, ifName)
}
The WipeXDP function calls down into RemoveXDP, in bpf/bpf.go, which currently does not check the name of the program before removing it.
@sjh5205 I have the same issue. Are you still working on the change? The v2 PR seems stuck.
Any updates on this? As it seems like the previous PR attempt by @sjh5205 has become stale, I would love to take this over and create a new PR directly in Calico (the Felix fork that was used is deprecated as Felix was moved to Calico itself).