kube-router Host services become available on ClusterIPs

See; https://github.com/kubernetes/kubernetes/issues/72236

Same problem in kube-router and the same solution.

Dec 27 '18 18:12 uablrek

I've played a bit with the solution proposed in the kubernetes issue but unfortunately, if running kube-router with --run-router, the second you delete the route from the local routing table the routing breaks.

Upstream router tries to send traffic to the hosts where kube-router is running, but the hosts reply with ICMP redirects.

e.g. given a service IP 10.205.210.10 and pinging it from the router

~# ping 10.205.210.10
...
64 bytes from 10.205.210.10: icmp_seq=208 ttl=64 time=0.141 ms
64 bytes from 10.205.210.10: icmp_seq=209 ttl=64 time=0.075 ms
64 bytes from 10.205.210.10: icmp_seq=210 ttl=64 time=0.101 ms
# running: ip route del table local 10.205.210.10/32
From 10.205.161.10: icmp_seq=212 Redirect Host(New nexthop: 10.205.160.1)
From 10.205.161.10: icmp_seq=213 Redirect Host(New nexthop: 10.205.160.1)
From 10.205.161.10: icmp_seq=214 Redirect Host(New nexthop: 10.205.160.1)
...

In general, even when not using kube-router for BGP, removing the routes from the local table breaks ICMP.

Dec 31 '18 01:12 asteven

@asteven Good observation.

But I think ICMP-echo just happens to work just because of the very problem described in the issue, that is; for ICMP-echo (ping) it is not the "real-servers" that respond but the the local machine (which it really shouldn't).

An interresting test should be if "connection related" ICMPs are handled correctly, like "port unreachable" for instance. I will try to make some tests.

If connection related ICMPs works (which I think they do) I don't think ICMP-echo (ping) have to work. In k8s with kube-proxy in "iptables" mode (the default mode) "ping" does not work either so this is the "normal" behavior.

Jan 02 '19 10:01 uablrek

@uablrek you are right that testing with ICMP-echo is not relevant here. That was just the first thing I tried when I noticed that something's broken.

However given the following test setup:

       +--------+
       | router |
       +--------+
         |    |
 +--------+  +--------+
 | k8s-01 |  | client |
 +--------+  +--------+

I have a http service (and pod) running on k8s-01 which is accessed by the client host. I can tcpdump the traffic from client via router to k8s-01 and ipvs, everything works just fine.

e.g. Here's the first packets of a http request from client to service to pod (tcpdump'ed on k8s-01).

00:28:59.803058 IP 10.205.4.14.44820 > 10.205.210.10.80: Flags [S], seq 2386296295, win 26880, options [mss 8960,sackOK,TS val 2804202327 ecr 0,nop,wscale 7], length 0
00:28:59.803095 IP 10.205.161.12.44820 > 10.205.240.48.8080: Flags [S], seq 2386296295, win 26880, options [mss 8960,sackOK,TS val 2804202327 ecr 0,nop,wscale 7], length 0
00:28:59.803232 IP 10.205.240.48.8080 > 10.205.161.12.44820: Flags [S.], seq 3783938709, ack 2386296296, win 28960, options [mss 1460,sackOK,TS val 1763007046 ecr 2804202327,nop,wscale 7], length 0
00:28:59.803251 IP 10.205.210.10.80 > 10.205.4.14.44820: Flags [S.], seq 3783938709, ack 2386296296, win 28960, options [mss 1460,sackOK,TS val 1763007046 ecr 2804202327,nop,wscale 7], length 0

If I delete the entry for the service IP from the local table on k8s-01 the incoming traffic still reaches the k8s-01 host, but ipvsadm does not handle the traffic.

e.g. the same http request as before from client to service goes nowhere.

00:32:37.191269 IP 10.205.4.14.44824 > 10.205.210.10.80: Flags [S], seq 1817232313, win 26880, options [mss 8960,sackOK,TS val 2804419715 ecr 0,nop,wscale 7], length 0
00:32:37.191297 IP 10.205.4.14.44824 > 10.205.210.10.80: Flags [S], seq 1817232313, win 26880, options [mss 8960,sackOK,TS val 2804419715 ecr 0,nop,wscale 7], length 0
00:32:38.192480 IP 10.205.4.14.44824 > 10.205.210.10.80: Flags [S], seq 1817232313, win 26880, options [mss 8960,sackOK,TS val 2804420716 ecr 0,nop,wscale 7], length 0
00:32:38.192494 IP 10.205.4.14.44824 > 10.205.210.10.80: Flags [S], seq 1817232313, win 26880, options [mss 8960,sackOK,TS val 2804420716 ecr 0,nop,wscale 7], length 0

The clients http request times out.

The same request still works from k8s-01, but not from the client host.

[root@client ~]# curl --connect-timeout 2 10.205.210.10
curl: (28) Connection timed out after 2001 milliseconds
[root@client ~]#

root@k8s-01:~# curl --connect-timeout 2 10.205.210.10
Hello, world!
Version: 1.0.0
Hostname: hello-deployment-576f797599-6ppz2
root@k8s-01:~#

Jan 04 '19 23:01 asteven

@asteven

Yes, I get the same. I only tried to access the ClusterIP from the main netns on a node and as you say, that works. But when packet comes from another address (or is forwarded?) it does not work.

I re-created the problem using a setup similar to yours and used externalIPs instead of ClusterIP. But I also suspect that traffic from PODs is not forwarded. If so, my proposal will not work.

I will test pod-pod as soon as I can.

Jan 05 '19 09:01 uablrek

Pod-to-pod traffic does not work when the local table entry is removed.

So I must withdraw my proposal. <sigh...>

I will try some variations though.

@asteven Thanks for checking up on this

Jan 05 '19 09:01 uablrek

It may be possible to run IPVS with a VIP-less director using iptables prerouting fwmarking and ipvs fwmark services. This may not need the VIPs on a local interface which would solve this issue. But not sure about the details and if that works in all cases.

Jan 05 '19 11:01 asteven

JFYI: I'm working on a PR #618 to fix #282 which is basically the same problem.

Jan 05 '19 11:01 asteven

Another experiment based on VIP-less director.

on the router

This is just to get the traffic from the client to k8s-01.

root@router:~# vip=10.205.210.99
root@router:~# k8s_01=10.205.161.10
root@router:~# ip route add $vip/32 via $k8s_01
root@router:~# ip route show | grep $vip
10.205.210.99 via 10.205.161.10 dev nic0
root@router:~#

on k8s-01

root@k8s-01:~# vip=10.205.210.99
root@k8s-01:~# ip rule add prio 100 to $vip table 100
root@k8s-01:~# ip route add local 0/0 dev lo table 100

# prove that the VIP not configured locally
root@k8s-01:~# ip addr | grep $vip
root@k8s-01:~# ip route show table local | grep $vip
root@k8s-01:~#

on the client

[root@client ~]# vip=10.205.210.99
[root@client ~]# curl --connect-timeout 2 $vip
Hello, world!
Version: 1.0.0
Hostname: hello-deployment-576f797599-6ppz2
[root@client ~]#
[root@client ~]# ssh -o ConnectTimeout=2 $vip
ssh: connect to host 10.205.210.99 port 22: Connection timed out
[root@client ~]#

Unfortunately connecting to the service from k8s-01 does not work anymore like this.

root@k8s-01:~# vip=10.205.210.99
root@k8s-01:~# curl --connect-timeout 2 $vip
curl: (7) Couldn't connect to server
root@k8s-01:~#

Note that if you're testing this while kube-router service proxy is running it will periodically delete the manually create the ipvs service.

Jan 07 '19 22:01 asteven

Not sure where to go from here. Also haven't tested pod to pod yet.

A solution based on the above experiments (possibly combined) could save us from writing quite some boilerplate code as local services can simply never be exposed. It would transparently work equally for ipv4 and ipv6 as far as I can tell.

But maybe the approach taken in #618 is better, because more explicit and less magical and thus easier to understand/debug. Not sure.

Jan 07 '19 22:01 asteven

iptables -I INPUT 1 -s [kube-bridge-cidr] -i kube-bridge -m set ! --match-set inet:KUBE-SVC-ALL dst,dst -j DROP
ebtables -I INPUT  --logical-in kube-bridge --destination [bridgeip_address] -j DROP

or ipv6 variant

iptables -I INPUT 1 -j ACCEPT -p icmp6
iptables -I INPUT 2 -s [kube-bridge-cidr] -i kube-bridge -m set ! --match-set inet6:KUBE-SVC-ALL dst,dst -j DROP
ebtables -I INPUT -p ipv6  --logical-in kube-bridge --ip6-destination [bridgeipv6_address] -j DROP

Jan 11 '19 01:01 mk01

@asteven Does #618 have any effect on this issue? Or does #618 only prevent access to host services via externalIP and clusterIP is still left open?

Apr 25 '20 06:04 aauren

Please see; https://github.com/kubernetes/kubernetes/issues/72236#issuecomment-614122808

This replaces the dummy interface with routes in the "local" table. The beauty IMHO is that the entire ClusteIP range can be defined in just one route. The iptables filter rule will prevent host access.

Apr 25 '20 07:04 uablrek

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Sep 06 '23 02:09 github-actions[bot]

This issue was closed because it has been stale for 5 days with no activity.

Sep 11 '23 02:09 github-actions[bot]