cilium icon indicating copy to clipboard operation
cilium copied to clipboard

Cilium dropping IPIP packets w/ unknown drop reason of 119

Open tehnerd opened this issue 1 year ago • 16 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

What happened?

Cilium is dropping packets w/ unknown drop reason. expected behavior: not having error code 119; but something else (if it is missconfiguration etc).

Cilium Version

Client: 1.15.1 a368c8f0 2024-02-14T22:16:57+00:00 go version go1.21.6 linux/amd64 Daemon: 1.15.1 a368c8f0 2024-02-14T22:16:57+00:00 go version go1.21.6 linux/amd64

Kernel Version

Linux dfw5a-rg19-9b 5.15.0-73-generic #80-Ubuntu SMP Mon May 15 15:18:26 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

Client Version: v1.28.5 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.28.5

Regression

No response

Sysdump

No response

Relevant log output

xx drop (119, 0) flow 0x94b1cf61 to endpoint 2125, ifindex 34, file bpf_lxc.c:251, , identity world->10294: 10.80.84.41:28757 -> 10.220.23.10:3991 tcp SYN
xx drop (119, 0) flow 0x8a358f62 to endpoint 1349, ifindex 33, file bpf_lxc.c:251, , identity world->29312: 10.80.84.41:26331 -> 10.220.23.10:3991 tcp SYN
xx drop (119, 0) flow 0xdcd19bbf to endpoint 2125, ifindex 34, file bpf_lxc.c:251, , identity world->10294: 10.80.82.54:16255 -> 10.220.23.10:3991 tcp SYN
xx drop (119, 0) flow 0xc255dbbc to endpoint 1349, ifindex 33, file bpf_lxc.c:251, , identity world->29312: 10.80.82.54:16167 -> 10.220.23.10:3991 tcp SYN
xx drop (119, 0) flow 0xff1a3516 to endpoint 3503, ifindex 32, file bpf_lxc.c:251, , identity world->32410: 10.80.107.38:16053 -> 10.220.23.9:3991 tcp SYN

Anything else?

environment where it is happening:

LB (not controlled by cilum) is sending ipip packet to the pod/k8s cluster where we have cilium installed. cilium is w/ default configuration. flow from logs above (e.g. 10.80.107.38:xxx -> 10.220.23.9:3991 is from the payload of ipip (aka inner packets etc))

it feels like drop happens here somewhere: https://github.com/cilium/cilium/blob/v1.15.1/bpf/bpf_lxc.c#L283 https://github.com/cilium/cilium/blob/v1.15.1/bpf/lib/conntrack.h#L884 https://github.com/cilium/cilium/blob/v1.15.1/bpf/lib/conntrack.h#L715

as ct_extract_ports4 does not have a case for ipip and 119 is a 256-DROP_CT_UNKNOWN_PROTO (137) but i failed so far to find how/where this could be misscalculated.

also in general it is unclear why in logs we have a line for inner flow but ct_lookup is being done (theory; unfrotunately even w/ debug-verbose datapath there are 0 log lines related to this) against ipip packet.

Do cilium even supports of passing IPIP from external load balancer (e.g. ipvs)

Cilium Users Document

  • [ ] Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

tehnerd avatar May 10 '24 18:05 tehnerd

@tehnerd interesting! Would you be able to capture a pwru trace of an affected packet?

squeed avatar May 14 '24 10:05 squeed

Yes. How to do this? I actually have a repro in dev environment so can take any debug info required

tehnerd avatar May 14 '24 14:05 tehnerd

Oh nvm. Missed that this is a link (on mobile). Will try to do today

tehnerd avatar May 14 '24 14:05 tehnerd

haven't run pwru yet; but i've confirmed that the drop is indeed in https://github.com/cilium/cilium/blob/v1.15.1/bpf/lib/conntrack.h#L761

i've added printk there:

default:
		printk("drop ct unknown proto\n");
		/* Can't handle extension headers yet */
		return DROP_CT_UNKNOWN_PROTO;

and in bpf tracelog

           gping-202700  [001] b.s1.  7402.652876: bpf_trace_printk: in ct extract ports

           gping-202700  [001] b.s1.  7402.652894: bpf_trace_printk: drop ct unknown proto

           gping-202700  [001] b.s1.  7402.652896: bpf_trace_printk: sending drop notification

(tests are running against latest commit in github)

tehnerd avatar May 14 '24 17:05 tehnerd

@squeed pwru output:

Ctehnerd:~/gh/cilium$ sudo ../pwru/pwru 'proto 4'
2024/05/14 17:13:29 Attaching kprobes (via kprobe-multi)...
1554 / 1554 [-----------------------------------------------------------------------------------] 100.00% ? p/s
2024/05/14 17:13:29 Attached (ignored 0)
2024/05/14 17:13:29 Listening for events..
               SKB    CPU          PROCESS                     FUNC
0xffff9f7db6178200      4   [gping:223270]     packet_parse_headers
0xffff9f7db6178200      4   [gping:223270]              packet_xmit
0xffff9f7db6178200      4   [gping:223270]         __dev_queue_xmit
0xffff9f7db6178200      4   [gping:223270]       qdisc_pkt_len_init
0xffff9f7db6178200      4   [gping:223270]      netdev_core_pick_tx
0xffff9f7db6178200      4   [gping:223270]        validate_xmit_skb
0xffff9f7db6178200      4   [gping:223270]       netif_skb_features
0xffff9f7db6178200      4   [gping:223270]  passthru_features_check
0xffff9f7db6178200      4   [gping:223270]     skb_network_protocol
0xffff9f7db6178200      4   [gping:223270]       validate_xmit_xfrm
0xffff9f7db6178200      4   [gping:223270]      dev_hard_start_xmit
0xffff9f7db6178200      4   [gping:223270]       dev_queue_xmit_nit
0xffff9f7db6178200      4   [gping:223270]                 skb_pull
0xffff9f7db6178200      4   [gping:223270]             nf_hook_slow
0xffff9f7db6178200      4   [gping:223270]                 skb_push
0xffff9f7db6178200      4   [gping:223270]         __dev_queue_xmit
0xffff9f7db6178200      4   [gping:223270]       qdisc_pkt_len_init
0xffff9f7db6178200      4   [gping:223270]      netdev_core_pick_tx
0xffff9f7db6178200      4   [gping:223270]        validate_xmit_skb
0xffff9f7db6178200      4   [gping:223270]       netif_skb_features
0xffff9f7db6178200      4   [gping:223270]  passthru_features_check
0xffff9f7db6178200      4   [gping:223270]     skb_network_protocol
0xffff9f7db6178200      4   [gping:223270]       validate_xmit_xfrm
0xffff9f7db6178200      4   [gping:223270]      dev_hard_start_xmit
0xffff9f7db6178200      4   [gping:223270]   skb_clone_tx_timestamp
0xffff9f7db6178200      4   [gping:223270]        __dev_forward_skb
0xffff9f7db6178200      4   [gping:223270]       __dev_forward_skb2
0xffff9f7db6178200      4   [gping:223270]         skb_scrub_packet
0xffff9f7db6178200      4   [gping:223270]           eth_type_trans
0xffff9f7db6178200      4   [gping:223270]               __netif_rx
0xffff9f7db6178200      4   [gping:223270]        netif_rx_internal
0xffff9f7db6178200      4   [gping:223270]       enqueue_to_backlog
0xffff9f7db6178200      4   [gping:223270]      __netif_receive_skb
0xffff9f7db6178200      4   [gping:223270] __netif_receive_skb_one_core
0xffff9f7db6178200      4   [gping:223270]             tcf_classify
0xffff9f7db6178200      4   [gping:223270]      skb_ensure_writable
0xffff9f7db6178200      4   [gping:223270]                   ip_rcv
0xffff9f7db6178200      4   [gping:223270]              ip_rcv_core
0xffff9f7db6178200      4   [gping:223270]               sock_wfree
0xffff9f7db6178200      4   [gping:223270]             nf_hook_slow
0xffff9f7db6178200      4   [gping:223270]     ip_route_input_noref
0xffff9f7db6178200      4   [gping:223270]      ip_route_input_slow
0xffff9f7db6178200      4   [gping:223270]          __mkroute_input
0xffff9f7db6178200      4   [gping:223270]      fib_validate_source
0xffff9f7db6178200      4   [gping:223270]    __fib_validate_source
0xffff9f7db6178200      4   [gping:223270]               ip_forward
0xffff9f7db6178200      4   [gping:223270]             nf_hook_slow
0xffff9f7db6178200      4   [gping:223270]        ip_forward_finish
0xffff9f7db6178200      4   [gping:223270]                ip_output
0xffff9f7db6178200      4   [gping:223270]             nf_hook_slow
0xffff9f7db6178200      4   [gping:223270]    apparmor_ip_postroute
0xffff9f7db6178200      4   [gping:223270]         ip_finish_output
0xffff9f7db6178200      4   [gping:223270]       __ip_finish_output
0xffff9f7db6178200      4   [gping:223270]        ip_finish_output2
0xffff9f7db6178200      4   [gping:223270]         __dev_queue_xmit
0xffff9f7db6178200      4   [gping:223270]       qdisc_pkt_len_init
0xffff9f7db6178200      4   [gping:223270]             tcf_classify
0xffff9f7db6178200      4   [gping:223270]      skb_ensure_writable
0xffff9f7db6178200      4   [gping:223270]      skb_ensure_writable
0xffff9f7db6178200      4   [gping:223270]      skb_ensure_writable
0xffff9f7db6178200      4   [gping:223270]      skb_ensure_writable
0xffff9f7db6178200      4   [gping:223270]      skb_ensure_writable
0xffff9f7db6178200      4   [gping:223270] kfree_skb_reason(SKB_DROP_REASON_TC_EGRESS)
0xffff9f7db6178200      4   [gping:223270]   skb_release_head_state
0xffff9f7db6178200      4   [gping:223270]         skb_release_data
0xffff9f7db6178200      4   [gping:223270]            skb_free_head
0xffff9f7db6178200      4   [gping:223270]             kfree_skbmem
^C2024/05/14 17:13:40 Received signal, exiting program..
2024/05/14 17:13:40 Detaching kprobes...
5 / 5 [----------------------------------------------------------------------------------------] 100.00% 20 p/s
~/gh/cilium$ 

tehnerd avatar May 14 '24 17:05 tehnerd

and sending ipip4 packet from the dev server to the the k8s pod which is running w/ kind on the same devserver (and cilium is installed on that cluster; with defualt config as generated during make kind from cilium dev docs)

tehnerd avatar May 14 '24 17:05 tehnerd

pwru.txt pwru w/ more flags:

sudo ../pwru/pwru 'proto 4' --output-tuple --output-stack --output-skb --output-meta  --output-file /tmp/pwru.txt

tehnerd avatar May 14 '24 17:05 tehnerd

generated packet was:

outer destination of ipip: 10.244.1.205 
inner destination of ipip: 10.244.1.205
inner source of ipip: 192.168.14.14 
outer soruce of ipip: 10.11.12.13 
sport 31337 
dport 80

tehnerd avatar May 14 '24 17:05 tehnerd

so w/ patch like

        if (ct_buffer.ret < 0)                                                  \
-               return drop_for_direction(ctx, DIR, ct_buffer.ret, ext_err);    \
+               return drop_for_direction(ctx, DIR, -ct_buffer.ret, ext_err);   \
        if (map_update_elem(&CT_TAIL_CALL_BUFFER4, &zero, &ct_buffer, 0) < 0)   \

and the issue it seems that ct_buffer.ret is int ; but drop_for_direction is expecting unsigned. so we have an issue w/ translating 2 compliment to unsigned

i've got

xx drop (CT: Unknown L4 protocol) flow 0x0 to endpoint 3294, ifindex 3, file bpf_lxc.c:248, , identity world-ipv4->50250: 192.168.14.14:31337 -> 10.244.1.205:80 tcp SYN
xx drop (CT: Unknown L4 protocol) flow 0x0 to endpoint 3294, ifindex 3, file bpf_lxc.c:248, , identity world-ipv4->50250: 192.168.14.14:31337 -> 10.244.1.205:80 tcp SYN
xx drop (CT: Unknown L4 protocol) flow 0x0 to endpoint 3294, ifindex 3, file bpf_lxc.c:248, , identity world-ipv4->50250: 192.168.14.14:31337 -> 10.244.1.205:80 tcp SYN

as expected. but the question is - are there any config option for cilium to pass ipip (e.g. conntrack should check against inner packet; not outer ipip) ? i thought it is supported

tehnerd avatar May 14 '24 19:05 tehnerd

So I made this work by calculating offsets so it is looking into inner ipv4 header and transport ports. But i have no idea what could this possible break. So wonder who could give us more info on how ipip supposed to be processed on ingress side etc

tehnerd avatar May 15 '24 01:05 tehnerd

changes which made this work (for ipv4; this is just to continue the discussion on what to do w/ ipip. mb there is a config option which allows to do the same? to allow ingress ipip in pod which is running cilium)

        __u32 zero = 0;                                                         \
-       void *map;                                                              \
-                                                                               \
+       void *map;                                                                      \
+       int off;                                                        \
        ct_state = (struct ct_state *)&ct_buffer.ct_state;                      \
        tuple = (struct ipv4_ct_tuple *)&ct_buffer.tuple;                       \
                                                                                \
        if (!revalidate_data(ctx, &data, &data_end, &ip4))                      \
                return drop_for_direction(ctx, DIR, DROP_INVALID, ext_err);     \
                                                                                \
+       off = ETH_HLEN;                                 \
        tuple->nexthdr = ip4->protocol;                                         \
+       if (tuple->nexthdr == IPPROTO_IPIP) { \
+               printk("IPIP\n"); \
+               off  = off + 20; \
+               if (!revalidate_data_l3_off(ctx, &data, &data_end, &ip4, off))  {               \
+                       printk("drop ipip with invalid size\n"); \
+                       return drop_for_direction(ctx, DIR, DROP_INVALID, ext_err);     \
+               } \
+               tuple->nexthdr = ip4->protocol;                                         \
+       } \
        tuple->daddr = ip4->daddr;                                              \
        tuple->saddr = ip4->saddr;                                              \
-       ct_buffer.l4_off = ETH_HLEN + ipv4_hdrlen(ip4);                         \
+       ct_buffer.l4_off = off + ipv4_hdrlen(ip4);                              \
                                                    

tehnerd avatar May 15 '24 04:05 tehnerd

I think @borkmann already has an implementation for this, but we have a tunnel iface on each node which we pass as --device along with eth0 😉

oblazek avatar May 21 '24 17:05 oblazek

Hi @tehnerd great to see you here! :) Do you expect the inbound LB traffic to be terminated in hostns of the Cilium nodes? Some time ago I added https://github.com/cilium/cilium/pull/31213 which just sets up an ipip device to do the former.

borkmann avatar May 21 '24 20:05 borkmann

Hey, Daniel! In our setup each container (aka pod) has tunl interface in its namespace. So we terminate ipip there

tehnerd avatar May 21 '24 20:05 tehnerd

Hey, Daniel! In our setup each container (aka pod) has tunl interface in its namespace. So we terminate ipip there

Ok, so that is currently not supported and needs to be extended for Cilium. I had some old code in https://github.com/cilium/cilium/pull/30547/commits for extracting inner tuple for service lookup, maybe it can be of help, or a diff properly cooked as patch as above.

borkmann avatar May 22 '24 06:05 borkmann

Yeah. I think recalc offset as I proposed above seems easier. And in our internal setup it is actually works as expected (at least all FW features seems working as expected on inner packets). Ok I think I will make something in a few weeks. Just need to run more internals tests etc to make sure nothing else is required

tehnerd avatar May 22 '24 06:05 tehnerd

I think I am facing same or related issue:

As part of organization policy, we use https://github.com/facebookincubator/katran as part of edge fabric. I've managed to get traffic from Katran with ingress pod with hostNetwork: true and following setup on host

#!/bin/bash
DEVICE=`ip route get 10.1.1.1 | grep dev | awk '{ print $5 }'`
GWIP=`ip route | grep default | awk '{ print $3 }'`
GWIP6=`ip -6 route | grep default | awk '{ print $3 }'`
MSS=`ip link show $DEVICE | grep mtu | awk '{ print $5-100 }'`

ip route change default via $GWIP advmss $MSS
ip -6 route change default via $GWIP6 advmss $MSS

ip addr add 10.15.12.33 dev lo
route add -host 10.15.12.33 dev lo

ip link add name ipip0 type ipip external
ip link set up ipip0
ip link set up tunl0
ip link add name ipip60 type ip6tnl external
ip link set up dev ipip60
ip link set up dev ip6tnl0

So, when I use nodeIP as Real (upstream in terms of Katran) and 10.15.12.33 as VIP traffic correctly routes to ingress pod.

Using hostNetwork is undesirable for scaling reasons, so my goal to make it work with Cilium LB. For this setup I've dropped VIP IP from lo and made following Service

apiVersion: v1
kind: Service
metadata:
  name: katran
  namespace: istio-system
  labels:
    ...
  annotations:
    "io.cilium/lb-ipam-ips": "10.15.12.33,someIPv6Address"
spec:
  selector:
    app: istio-ingressgateway
    istio: ingressgateway
  ports:
    - name: status-port
      protocol: TCP
      port: 15021
      targetPort: 15021
    - name: http2
      protocol: TCP
      port: 80
      targetPort: 8080
    - name: https
      protocol: TCP
      port: 443
      targetPort: 8443
  type: LoadBalancer
  externalTrafficPolicy: Local
  ipFamilies:
    - IPv4
    - IPv6
  ipFamilyPolicy: PreferDualStack
  allocateLoadBalancerNodePorts: false
  internalTrafficPolicy: Cluster

Both LB IPs are announced by BGP policy, but VIP (10.15.12.33) got denied by router, thus not conflicting with Katran's advertisements. Now when I change Katran's real to someIPv6Address I can observe incoming traffic by tcpdump and cilium-dbg monitor.

19:04:03.194239 IP6 (flowlabel 0x35156, hlim 5, next-header IPIP (4) payload length: 60) SrcIpv6::2 > LbIPb6: IP (tos 0x0, ttl 64, id 6251, offset 0, flags [DF], proto TCP (6), length 60)
    10.254.0.253.56383 > 10.15.12.33.http: Flags [S], cksum 0xd893 (correct), seq 4135035203, win 42340, options [mss 1460,sackOK,TS val 3985720831 ecr 0,nop,wscale 8], length 0
cilium-dbg monitor | grep 10.15.12.33
Listening for events on 96 CPUs with 64x4096 of shared memory
Press Ctrl-C to quit
time="2024-06-17T17:18:39Z" level=info msg="Initializing dissection cache..." subsys=monitor
-> network flow 0x34520aa0 , identity unknown->unknown state unknown ifindex ens13f0np0 orig-ip 0.0.0.0: 10.254.0.253:38485 -> 10.15.12.33:80 tcp SYN
-> network flow 0x34520aa0 , identity unknown->unknown state unknown ifindex ens13f0np0 orig-ip 0.0.0.0: 10.254.0.253:38485 -> 10.15.12.33:80 tcp SYN
-> network flow 0x34520aa0 , identity unknown->unknown state unknown ifindex ens13f0np0 orig-ip 0.0.0.0: 10.254.0.253:38485 -> 10.15.12.33:80 tcp SYN
-> network flow 0x34520aa0 , identity unknown->unknown state unknown ifindex ens13f0np0 orig-ip 0.0.0.0: 10.254.0.253:38485 -> 10.15.12.33:80 tcp SYN
-> network flow 0x34520aa0 , identity unknown->unknown state unknown ifindex ens13f0np0 orig-ip 0.0.0.0: 10.254.0.253:38485 -> 10.15.12.33:80 tcp SYN
cilium-dbg service list | grep 10.15.12.33
507   10.15.12.33:15021                                  LoadBalancer   1 => 172.16.17.215:15021 (active)               
508   10.15.12.33:80                                     LoadBalancer   1 => 172.16.17.215:8080 (active)                
509   10.15.12.33:443                                    LoadBalancer   1 => 172.16.17.215:8443 (active)                
519   10.15.12.33:15021/i                                LoadBalancer   1 => 172.16.17.215:15021 (active)               
520   10.15.12.33:80/i                                   LoadBalancer   1 => 172.16.17.215:8080 (active)                
521   10.15.12.33:443/i                                  LoadBalancer   1 => 172.16.17.215:8443 (active)      
cilium-dbg endpoint list | grep 172.16.17.215
1319       Disabled           Disabled          95584      k8s:app=istio-ingressgateway    PodIpv6   172.16.17.215   ready

There are no drops in cilium-dbg monitor output, but no traffic reaches the pod.

YakhontovYaroslav avatar Jun 17 '24 17:06 YakhontovYaroslav

@borkmann, may I ask You why https://github.com/cilium/cilium/pull/30547 did not get merged? As far as I understand, 'external Cilium in L4B mode' works exactly as Katran in our case. IPIP termination in worker node NS is also exactly the same usecase we are planning to use as part of our flow.

So, complete flow that we are expecting is:

  • ClientIPv4 -> Katran VIPv4
  • KatranNodeIPv6 -> LoadBalancer IPv6 [inner ClientIPv4 -> VIPv4, where VIPv4 can also defined in service spec if necessary]
  • k8s node terminates IPIP/IP6IP6/IP6IP and gets ClientIPv4 -> VIPv4
  • traffic flows like regular LB IP to Pod and then gets sent back to client as VIPv4 -> ClientIPv4 DSR

Thanks.

YakhontovYaroslav avatar Jun 17 '24 20:06 YakhontovYaroslav

so w/ patch like

        if (ct_buffer.ret < 0)                                                  \
-               return drop_for_direction(ctx, DIR, ct_buffer.ret, ext_err);    \
+               return drop_for_direction(ctx, DIR, -ct_buffer.ret, ext_err);   \
        if (map_update_elem(&CT_TAIL_CALL_BUFFER4, &zero, &ct_buffer, 0) < 0)   \

and the issue it seems that ct_buffer.ret is int ; but drop_for_direction is expecting unsigned. so we have an issue w/ translating 2 compliment to unsigned

fyi, I extracted this part into https://github.com/cilium/cilium/pull/33551. Thanks for tracing it down @tehnerd !

julianwiedmann avatar Jul 03 '24 07:07 julianwiedmann

Thanks. We are mostly done with testing and I'm planning to upstream our internal Patches in next few weeks. One less thing to do :)

tehnerd avatar Jul 03 '24 14:07 tehnerd

@julianwiedmann @tehnerd Is this issue resolved, should we close it?

joestringer avatar Sep 05 '24 00:09 joestringer

119 is resolved. Ipip is still not supported in upstream. But we have some internal patches and testing it with v4 and things looks promising right now. Hopefully I would upstream them in some time. We can close the issue

tehnerd avatar Sep 07 '24 02:09 tehnerd

Hey, Daniel! In our setup each container (aka pod) has tunl interface in its namespace. So we terminate ipip there

@tehnerd We have a similar setup, by the way - each pod has a tunnel interface, and the termination is done there. We are considering switching from Calico to Cilium, and in the scope of PoC, the same issue was raised - Cilium drops IPIP packets from the external load balancer (ipvs).

xx drop (CT: Unknown L4 protocol) flow 0x0 to endpoint 143, ifindex 7, file bpf_lxc.c:263, , identity remote-node->26173: *.*.*.*:61172 -> *.*.*.*:8080 tcp SYN

Are there any updates on your ongoing work on resolving this issue?

mpelekh avatar Feb 20 '25 14:02 mpelekh

We have custom patch and it works just fine for us. I just need to find time to do proper upstreaming... Honestly with amount of other work it is most likely not going to happen soon.

tehnerd avatar Feb 20 '25 15:02 tehnerd

We have custom patch and it works just fine for us. I just need to find time to do proper upstreaming... Honestly with amount of other work it is most likely not going to happen soon.

Gotcha. Thanks for your response @tehnerd. Would you happen to have a plan at least to share a patch to try it out?

mpelekh avatar Feb 20 '25 16:02 mpelekh

Yeah. I will post it a bit later. We made it works for ipv4 but it is trivial to make the same for v6 if requires. The overall idea is that if we receive ipip we apply lookups and fw rules for inner packet

tehnerd avatar Feb 20 '25 16:02 tehnerd

https://gist.github.com/tehnerd/f217e4ebf75d08f8d6dc3ffee4392ae4 on top of 1.15 version

tehnerd avatar Feb 20 '25 19:02 tehnerd

https://gist.github.com/tehnerd/f217e4ebf75d08f8d6dc3ffee4392ae4 on top of 1.15 version

Thank you @tehnerd. I will give it a try.

mpelekh avatar Feb 21 '25 08:02 mpelekh

https://gist.github.com/tehnerd/f217e4ebf75d08f8d6dc3ffee4392ae4 on top of 1.15 version

@tehnerd I verified the patch you shared, and it works as expected. With it, all the lookups and firewall rules are done for the inner IPIP packet; the IPIP packet is terminated in Pods' NS.

The only change that caused the connection between pods to time out is the caching you've added. So I reverted it, and everything went well.

@@ -514,6 +514,7 @@ resolve_srcid_ipv4(struct __ctx_buff *ctx, struct iphdr *ip4,
                   const bool from_host)
 {
        __u32 src_id = WORLD_IPV4_ID, srcid_from_ipcache = srcid_from_proxy;
+       bool cache_entry_found = false;
        struct remote_endpoint_info *info = NULL;
 
        /* Packets from the proxy will already have a real identity. */
@@ -531,16 +532,20 @@ resolve_srcid_ipv4(struct __ctx_buff *ctx, struct iphdr *ip4,
                                 * the host. So we can ignore the ipcache if it
                                 * reports the source as HOST_ID.
                                 */
-                               if (*sec_identity != HOST_ID)
+                               if (*sec_identity != HOST_ID) {
+                                       cache_entry_found = true;
                                        srcid_from_ipcache = *sec_identity;
+                               }
                        }
                }
                cilium_dbg(ctx, info ? DBG_IP_ID_MAP_SUCCEED4 : DBG_IP_ID_MAP_FAILED4,
                           ip4->saddr, srcid_from_ipcache);
+
        }
-
-       if (from_host)
+
+       if (from_host || cache_entry_found) {
                src_id = srcid_from_ipcache;
+       }

mpelekh avatar Mar 03 '25 13:03 mpelekh