cilium
cilium copied to clipboard
Cilium dropping IPIP packets w/ unknown drop reason of 119
Is there an existing issue for this?
- [X] I have searched the existing issues
What happened?
Cilium is dropping packets w/ unknown drop reason. expected behavior: not having error code 119; but something else (if it is missconfiguration etc).
Cilium Version
Client: 1.15.1 a368c8f0 2024-02-14T22:16:57+00:00 go version go1.21.6 linux/amd64 Daemon: 1.15.1 a368c8f0 2024-02-14T22:16:57+00:00 go version go1.21.6 linux/amd64
Kernel Version
Linux dfw5a-rg19-9b 5.15.0-73-generic #80-Ubuntu SMP Mon May 15 15:18:26 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes Version
Client Version: v1.28.5 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.28.5
Regression
No response
Sysdump
No response
Relevant log output
xx drop (119, 0) flow 0x94b1cf61 to endpoint 2125, ifindex 34, file bpf_lxc.c:251, , identity world->10294: 10.80.84.41:28757 -> 10.220.23.10:3991 tcp SYN
xx drop (119, 0) flow 0x8a358f62 to endpoint 1349, ifindex 33, file bpf_lxc.c:251, , identity world->29312: 10.80.84.41:26331 -> 10.220.23.10:3991 tcp SYN
xx drop (119, 0) flow 0xdcd19bbf to endpoint 2125, ifindex 34, file bpf_lxc.c:251, , identity world->10294: 10.80.82.54:16255 -> 10.220.23.10:3991 tcp SYN
xx drop (119, 0) flow 0xc255dbbc to endpoint 1349, ifindex 33, file bpf_lxc.c:251, , identity world->29312: 10.80.82.54:16167 -> 10.220.23.10:3991 tcp SYN
xx drop (119, 0) flow 0xff1a3516 to endpoint 3503, ifindex 32, file bpf_lxc.c:251, , identity world->32410: 10.80.107.38:16053 -> 10.220.23.9:3991 tcp SYN
Anything else?
environment where it is happening:
LB (not controlled by cilum) is sending ipip packet to the pod/k8s cluster where we have cilium installed. cilium is w/ default configuration. flow from logs above (e.g. 10.80.107.38:xxx -> 10.220.23.9:3991 is from the payload of ipip (aka inner packets etc))
it feels like drop happens here somewhere: https://github.com/cilium/cilium/blob/v1.15.1/bpf/bpf_lxc.c#L283 https://github.com/cilium/cilium/blob/v1.15.1/bpf/lib/conntrack.h#L884 https://github.com/cilium/cilium/blob/v1.15.1/bpf/lib/conntrack.h#L715
as ct_extract_ports4 does not have a case for ipip and 119 is a 256-DROP_CT_UNKNOWN_PROTO (137) but i failed so far to find how/where this could be misscalculated.
also in general it is unclear why in logs we have a line for inner flow but ct_lookup is being done (theory; unfrotunately even w/ debug-verbose datapath there are 0 log lines related to this) against ipip packet.
Do cilium even supports of passing IPIP from external load balancer (e.g. ipvs)
Cilium Users Document
- [ ] Are you a user of Cilium? Please add yourself to the Users doc
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
@tehnerd interesting! Would you be able to capture a pwru trace of an affected packet?
Yes. How to do this? I actually have a repro in dev environment so can take any debug info required
Oh nvm. Missed that this is a link (on mobile). Will try to do today
haven't run pwru yet; but i've confirmed that the drop is indeed in https://github.com/cilium/cilium/blob/v1.15.1/bpf/lib/conntrack.h#L761
i've added printk there:
default:
printk("drop ct unknown proto\n");
/* Can't handle extension headers yet */
return DROP_CT_UNKNOWN_PROTO;
and in bpf tracelog
gping-202700 [001] b.s1. 7402.652876: bpf_trace_printk: in ct extract ports
gping-202700 [001] b.s1. 7402.652894: bpf_trace_printk: drop ct unknown proto
gping-202700 [001] b.s1. 7402.652896: bpf_trace_printk: sending drop notification
(tests are running against latest commit in github)
@squeed pwru output:
Ctehnerd:~/gh/cilium$ sudo ../pwru/pwru 'proto 4'
2024/05/14 17:13:29 Attaching kprobes (via kprobe-multi)...
1554 / 1554 [-----------------------------------------------------------------------------------] 100.00% ? p/s
2024/05/14 17:13:29 Attached (ignored 0)
2024/05/14 17:13:29 Listening for events..
SKB CPU PROCESS FUNC
0xffff9f7db6178200 4 [gping:223270] packet_parse_headers
0xffff9f7db6178200 4 [gping:223270] packet_xmit
0xffff9f7db6178200 4 [gping:223270] __dev_queue_xmit
0xffff9f7db6178200 4 [gping:223270] qdisc_pkt_len_init
0xffff9f7db6178200 4 [gping:223270] netdev_core_pick_tx
0xffff9f7db6178200 4 [gping:223270] validate_xmit_skb
0xffff9f7db6178200 4 [gping:223270] netif_skb_features
0xffff9f7db6178200 4 [gping:223270] passthru_features_check
0xffff9f7db6178200 4 [gping:223270] skb_network_protocol
0xffff9f7db6178200 4 [gping:223270] validate_xmit_xfrm
0xffff9f7db6178200 4 [gping:223270] dev_hard_start_xmit
0xffff9f7db6178200 4 [gping:223270] dev_queue_xmit_nit
0xffff9f7db6178200 4 [gping:223270] skb_pull
0xffff9f7db6178200 4 [gping:223270] nf_hook_slow
0xffff9f7db6178200 4 [gping:223270] skb_push
0xffff9f7db6178200 4 [gping:223270] __dev_queue_xmit
0xffff9f7db6178200 4 [gping:223270] qdisc_pkt_len_init
0xffff9f7db6178200 4 [gping:223270] netdev_core_pick_tx
0xffff9f7db6178200 4 [gping:223270] validate_xmit_skb
0xffff9f7db6178200 4 [gping:223270] netif_skb_features
0xffff9f7db6178200 4 [gping:223270] passthru_features_check
0xffff9f7db6178200 4 [gping:223270] skb_network_protocol
0xffff9f7db6178200 4 [gping:223270] validate_xmit_xfrm
0xffff9f7db6178200 4 [gping:223270] dev_hard_start_xmit
0xffff9f7db6178200 4 [gping:223270] skb_clone_tx_timestamp
0xffff9f7db6178200 4 [gping:223270] __dev_forward_skb
0xffff9f7db6178200 4 [gping:223270] __dev_forward_skb2
0xffff9f7db6178200 4 [gping:223270] skb_scrub_packet
0xffff9f7db6178200 4 [gping:223270] eth_type_trans
0xffff9f7db6178200 4 [gping:223270] __netif_rx
0xffff9f7db6178200 4 [gping:223270] netif_rx_internal
0xffff9f7db6178200 4 [gping:223270] enqueue_to_backlog
0xffff9f7db6178200 4 [gping:223270] __netif_receive_skb
0xffff9f7db6178200 4 [gping:223270] __netif_receive_skb_one_core
0xffff9f7db6178200 4 [gping:223270] tcf_classify
0xffff9f7db6178200 4 [gping:223270] skb_ensure_writable
0xffff9f7db6178200 4 [gping:223270] ip_rcv
0xffff9f7db6178200 4 [gping:223270] ip_rcv_core
0xffff9f7db6178200 4 [gping:223270] sock_wfree
0xffff9f7db6178200 4 [gping:223270] nf_hook_slow
0xffff9f7db6178200 4 [gping:223270] ip_route_input_noref
0xffff9f7db6178200 4 [gping:223270] ip_route_input_slow
0xffff9f7db6178200 4 [gping:223270] __mkroute_input
0xffff9f7db6178200 4 [gping:223270] fib_validate_source
0xffff9f7db6178200 4 [gping:223270] __fib_validate_source
0xffff9f7db6178200 4 [gping:223270] ip_forward
0xffff9f7db6178200 4 [gping:223270] nf_hook_slow
0xffff9f7db6178200 4 [gping:223270] ip_forward_finish
0xffff9f7db6178200 4 [gping:223270] ip_output
0xffff9f7db6178200 4 [gping:223270] nf_hook_slow
0xffff9f7db6178200 4 [gping:223270] apparmor_ip_postroute
0xffff9f7db6178200 4 [gping:223270] ip_finish_output
0xffff9f7db6178200 4 [gping:223270] __ip_finish_output
0xffff9f7db6178200 4 [gping:223270] ip_finish_output2
0xffff9f7db6178200 4 [gping:223270] __dev_queue_xmit
0xffff9f7db6178200 4 [gping:223270] qdisc_pkt_len_init
0xffff9f7db6178200 4 [gping:223270] tcf_classify
0xffff9f7db6178200 4 [gping:223270] skb_ensure_writable
0xffff9f7db6178200 4 [gping:223270] skb_ensure_writable
0xffff9f7db6178200 4 [gping:223270] skb_ensure_writable
0xffff9f7db6178200 4 [gping:223270] skb_ensure_writable
0xffff9f7db6178200 4 [gping:223270] skb_ensure_writable
0xffff9f7db6178200 4 [gping:223270] kfree_skb_reason(SKB_DROP_REASON_TC_EGRESS)
0xffff9f7db6178200 4 [gping:223270] skb_release_head_state
0xffff9f7db6178200 4 [gping:223270] skb_release_data
0xffff9f7db6178200 4 [gping:223270] skb_free_head
0xffff9f7db6178200 4 [gping:223270] kfree_skbmem
^C2024/05/14 17:13:40 Received signal, exiting program..
2024/05/14 17:13:40 Detaching kprobes...
5 / 5 [----------------------------------------------------------------------------------------] 100.00% 20 p/s
~/gh/cilium$
and sending ipip4 packet from the dev server to the the k8s pod which is running w/ kind on the same devserver (and cilium is installed on that cluster; with defualt config as generated during make kind from cilium dev docs)
pwru.txt pwru w/ more flags:
sudo ../pwru/pwru 'proto 4' --output-tuple --output-stack --output-skb --output-meta --output-file /tmp/pwru.txt
generated packet was:
outer destination of ipip: 10.244.1.205
inner destination of ipip: 10.244.1.205
inner source of ipip: 192.168.14.14
outer soruce of ipip: 10.11.12.13
sport 31337
dport 80
so w/ patch like
if (ct_buffer.ret < 0) \
- return drop_for_direction(ctx, DIR, ct_buffer.ret, ext_err); \
+ return drop_for_direction(ctx, DIR, -ct_buffer.ret, ext_err); \
if (map_update_elem(&CT_TAIL_CALL_BUFFER4, &zero, &ct_buffer, 0) < 0) \
and the issue it seems that ct_buffer.ret is int ; but drop_for_direction is expecting unsigned. so we have an issue w/ translating 2 compliment to unsigned
i've got
xx drop (CT: Unknown L4 protocol) flow 0x0 to endpoint 3294, ifindex 3, file bpf_lxc.c:248, , identity world-ipv4->50250: 192.168.14.14:31337 -> 10.244.1.205:80 tcp SYN
xx drop (CT: Unknown L4 protocol) flow 0x0 to endpoint 3294, ifindex 3, file bpf_lxc.c:248, , identity world-ipv4->50250: 192.168.14.14:31337 -> 10.244.1.205:80 tcp SYN
xx drop (CT: Unknown L4 protocol) flow 0x0 to endpoint 3294, ifindex 3, file bpf_lxc.c:248, , identity world-ipv4->50250: 192.168.14.14:31337 -> 10.244.1.205:80 tcp SYN
as expected. but the question is - are there any config option for cilium to pass ipip (e.g. conntrack should check against inner packet; not outer ipip) ? i thought it is supported
So I made this work by calculating offsets so it is looking into inner ipv4 header and transport ports. But i have no idea what could this possible break. So wonder who could give us more info on how ipip supposed to be processed on ingress side etc
changes which made this work (for ipv4; this is just to continue the discussion on what to do w/ ipip. mb there is a config option which allows to do the same? to allow ingress ipip in pod which is running cilium)
__u32 zero = 0; \
- void *map; \
- \
+ void *map; \
+ int off; \
ct_state = (struct ct_state *)&ct_buffer.ct_state; \
tuple = (struct ipv4_ct_tuple *)&ct_buffer.tuple; \
\
if (!revalidate_data(ctx, &data, &data_end, &ip4)) \
return drop_for_direction(ctx, DIR, DROP_INVALID, ext_err); \
\
+ off = ETH_HLEN; \
tuple->nexthdr = ip4->protocol; \
+ if (tuple->nexthdr == IPPROTO_IPIP) { \
+ printk("IPIP\n"); \
+ off = off + 20; \
+ if (!revalidate_data_l3_off(ctx, &data, &data_end, &ip4, off)) { \
+ printk("drop ipip with invalid size\n"); \
+ return drop_for_direction(ctx, DIR, DROP_INVALID, ext_err); \
+ } \
+ tuple->nexthdr = ip4->protocol; \
+ } \
tuple->daddr = ip4->daddr; \
tuple->saddr = ip4->saddr; \
- ct_buffer.l4_off = ETH_HLEN + ipv4_hdrlen(ip4); \
+ ct_buffer.l4_off = off + ipv4_hdrlen(ip4); \
I think @borkmann already has an implementation for this, but we have a tunnel iface on each node which we pass as --device along with eth0 😉
Hi @tehnerd great to see you here! :) Do you expect the inbound LB traffic to be terminated in hostns of the Cilium nodes? Some time ago I added https://github.com/cilium/cilium/pull/31213 which just sets up an ipip device to do the former.
Hey, Daniel! In our setup each container (aka pod) has tunl interface in its namespace. So we terminate ipip there
Hey, Daniel! In our setup each container (aka pod) has tunl interface in its namespace. So we terminate ipip there
Ok, so that is currently not supported and needs to be extended for Cilium. I had some old code in https://github.com/cilium/cilium/pull/30547/commits for extracting inner tuple for service lookup, maybe it can be of help, or a diff properly cooked as patch as above.
Yeah. I think recalc offset as I proposed above seems easier. And in our internal setup it is actually works as expected (at least all FW features seems working as expected on inner packets). Ok I think I will make something in a few weeks. Just need to run more internals tests etc to make sure nothing else is required
I think I am facing same or related issue:
As part of organization policy, we use https://github.com/facebookincubator/katran as part of edge fabric.
I've managed to get traffic from Katran with ingress pod with hostNetwork: true and following setup on host
#!/bin/bash
DEVICE=`ip route get 10.1.1.1 | grep dev | awk '{ print $5 }'`
GWIP=`ip route | grep default | awk '{ print $3 }'`
GWIP6=`ip -6 route | grep default | awk '{ print $3 }'`
MSS=`ip link show $DEVICE | grep mtu | awk '{ print $5-100 }'`
ip route change default via $GWIP advmss $MSS
ip -6 route change default via $GWIP6 advmss $MSS
ip addr add 10.15.12.33 dev lo
route add -host 10.15.12.33 dev lo
ip link add name ipip0 type ipip external
ip link set up ipip0
ip link set up tunl0
ip link add name ipip60 type ip6tnl external
ip link set up dev ipip60
ip link set up dev ip6tnl0
So, when I use nodeIP as Real (upstream in terms of Katran) and 10.15.12.33 as VIP traffic correctly routes to ingress pod.
Using hostNetwork is undesirable for scaling reasons, so my goal to make it work with Cilium LB. For this setup I've dropped VIP IP from lo and made following Service
apiVersion: v1
kind: Service
metadata:
name: katran
namespace: istio-system
labels:
...
annotations:
"io.cilium/lb-ipam-ips": "10.15.12.33,someIPv6Address"
spec:
selector:
app: istio-ingressgateway
istio: ingressgateway
ports:
- name: status-port
protocol: TCP
port: 15021
targetPort: 15021
- name: http2
protocol: TCP
port: 80
targetPort: 8080
- name: https
protocol: TCP
port: 443
targetPort: 8443
type: LoadBalancer
externalTrafficPolicy: Local
ipFamilies:
- IPv4
- IPv6
ipFamilyPolicy: PreferDualStack
allocateLoadBalancerNodePorts: false
internalTrafficPolicy: Cluster
Both LB IPs are announced by BGP policy, but VIP (10.15.12.33) got denied by router, thus not conflicting with Katran's advertisements. Now when I change Katran's real to someIPv6Address I can observe incoming traffic by tcpdump and cilium-dbg monitor.
19:04:03.194239 IP6 (flowlabel 0x35156, hlim 5, next-header IPIP (4) payload length: 60) SrcIpv6::2 > LbIPb6: IP (tos 0x0, ttl 64, id 6251, offset 0, flags [DF], proto TCP (6), length 60)
10.254.0.253.56383 > 10.15.12.33.http: Flags [S], cksum 0xd893 (correct), seq 4135035203, win 42340, options [mss 1460,sackOK,TS val 3985720831 ecr 0,nop,wscale 8], length 0
cilium-dbg monitor | grep 10.15.12.33
Listening for events on 96 CPUs with 64x4096 of shared memory
Press Ctrl-C to quit
time="2024-06-17T17:18:39Z" level=info msg="Initializing dissection cache..." subsys=monitor
-> network flow 0x34520aa0 , identity unknown->unknown state unknown ifindex ens13f0np0 orig-ip 0.0.0.0: 10.254.0.253:38485 -> 10.15.12.33:80 tcp SYN
-> network flow 0x34520aa0 , identity unknown->unknown state unknown ifindex ens13f0np0 orig-ip 0.0.0.0: 10.254.0.253:38485 -> 10.15.12.33:80 tcp SYN
-> network flow 0x34520aa0 , identity unknown->unknown state unknown ifindex ens13f0np0 orig-ip 0.0.0.0: 10.254.0.253:38485 -> 10.15.12.33:80 tcp SYN
-> network flow 0x34520aa0 , identity unknown->unknown state unknown ifindex ens13f0np0 orig-ip 0.0.0.0: 10.254.0.253:38485 -> 10.15.12.33:80 tcp SYN
-> network flow 0x34520aa0 , identity unknown->unknown state unknown ifindex ens13f0np0 orig-ip 0.0.0.0: 10.254.0.253:38485 -> 10.15.12.33:80 tcp SYN
cilium-dbg service list | grep 10.15.12.33
507 10.15.12.33:15021 LoadBalancer 1 => 172.16.17.215:15021 (active)
508 10.15.12.33:80 LoadBalancer 1 => 172.16.17.215:8080 (active)
509 10.15.12.33:443 LoadBalancer 1 => 172.16.17.215:8443 (active)
519 10.15.12.33:15021/i LoadBalancer 1 => 172.16.17.215:15021 (active)
520 10.15.12.33:80/i LoadBalancer 1 => 172.16.17.215:8080 (active)
521 10.15.12.33:443/i LoadBalancer 1 => 172.16.17.215:8443 (active)
cilium-dbg endpoint list | grep 172.16.17.215
1319 Disabled Disabled 95584 k8s:app=istio-ingressgateway PodIpv6 172.16.17.215 ready
There are no drops in cilium-dbg monitor output, but no traffic reaches the pod.
@borkmann, may I ask You why https://github.com/cilium/cilium/pull/30547 did not get merged? As far as I understand, 'external Cilium in L4B mode' works exactly as Katran in our case. IPIP termination in worker node NS is also exactly the same usecase we are planning to use as part of our flow.
So, complete flow that we are expecting is:
- ClientIPv4 -> Katran VIPv4
- KatranNodeIPv6 -> LoadBalancer IPv6 [inner ClientIPv4 -> VIPv4, where VIPv4 can also defined in service spec if necessary]
- k8s node terminates IPIP/IP6IP6/IP6IP and gets ClientIPv4 -> VIPv4
- traffic flows like regular LB IP to Pod and then gets sent back to client as VIPv4 -> ClientIPv4 DSR
Thanks.
so w/ patch like
if (ct_buffer.ret < 0) \ - return drop_for_direction(ctx, DIR, ct_buffer.ret, ext_err); \ + return drop_for_direction(ctx, DIR, -ct_buffer.ret, ext_err); \ if (map_update_elem(&CT_TAIL_CALL_BUFFER4, &zero, &ct_buffer, 0) < 0) \and the issue it seems that ct_buffer.ret is int ; but drop_for_direction is expecting unsigned. so we have an issue w/ translating 2 compliment to unsigned
fyi, I extracted this part into https://github.com/cilium/cilium/pull/33551. Thanks for tracing it down @tehnerd !
Thanks. We are mostly done with testing and I'm planning to upstream our internal Patches in next few weeks. One less thing to do :)
@julianwiedmann @tehnerd Is this issue resolved, should we close it?
119 is resolved. Ipip is still not supported in upstream. But we have some internal patches and testing it with v4 and things looks promising right now. Hopefully I would upstream them in some time. We can close the issue
Hey, Daniel! In our setup each container (aka pod) has tunl interface in its namespace. So we terminate ipip there
@tehnerd We have a similar setup, by the way - each pod has a tunnel interface, and the termination is done there. We are considering switching from Calico to Cilium, and in the scope of PoC, the same issue was raised - Cilium drops IPIP packets from the external load balancer (ipvs).
xx drop (CT: Unknown L4 protocol) flow 0x0 to endpoint 143, ifindex 7, file bpf_lxc.c:263, , identity remote-node->26173: *.*.*.*:61172 -> *.*.*.*:8080 tcp SYN
Are there any updates on your ongoing work on resolving this issue?
We have custom patch and it works just fine for us. I just need to find time to do proper upstreaming... Honestly with amount of other work it is most likely not going to happen soon.
We have custom patch and it works just fine for us. I just need to find time to do proper upstreaming... Honestly with amount of other work it is most likely not going to happen soon.
Gotcha. Thanks for your response @tehnerd. Would you happen to have a plan at least to share a patch to try it out?
Yeah. I will post it a bit later. We made it works for ipv4 but it is trivial to make the same for v6 if requires. The overall idea is that if we receive ipip we apply lookups and fw rules for inner packet
https://gist.github.com/tehnerd/f217e4ebf75d08f8d6dc3ffee4392ae4 on top of 1.15 version
https://gist.github.com/tehnerd/f217e4ebf75d08f8d6dc3ffee4392ae4 on top of 1.15 version
Thank you @tehnerd. I will give it a try.
https://gist.github.com/tehnerd/f217e4ebf75d08f8d6dc3ffee4392ae4 on top of 1.15 version
@tehnerd I verified the patch you shared, and it works as expected. With it, all the lookups and firewall rules are done for the inner IPIP packet; the IPIP packet is terminated in Pods' NS.
The only change that caused the connection between pods to time out is the caching you've added. So I reverted it, and everything went well.
@@ -514,6 +514,7 @@ resolve_srcid_ipv4(struct __ctx_buff *ctx, struct iphdr *ip4,
const bool from_host)
{
__u32 src_id = WORLD_IPV4_ID, srcid_from_ipcache = srcid_from_proxy;
+ bool cache_entry_found = false;
struct remote_endpoint_info *info = NULL;
/* Packets from the proxy will already have a real identity. */
@@ -531,16 +532,20 @@ resolve_srcid_ipv4(struct __ctx_buff *ctx, struct iphdr *ip4,
* the host. So we can ignore the ipcache if it
* reports the source as HOST_ID.
*/
- if (*sec_identity != HOST_ID)
+ if (*sec_identity != HOST_ID) {
+ cache_entry_found = true;
srcid_from_ipcache = *sec_identity;
+ }
}
}
cilium_dbg(ctx, info ? DBG_IP_ID_MAP_SUCCEED4 : DBG_IP_ID_MAP_FAILED4,
ip4->saddr, srcid_from_ipcache);
+
}
-
- if (from_host)
+
+ if (from_host || cache_entry_found) {
src_id = srcid_from_ipcache;
+ }