k3s icon indicating copy to clipboard operation
k3s copied to clipboard

Server nodes behind NAT, pod networking is broken

Open NandoTheessen opened this issue 1 year ago • 8 comments

Environmental Info: K3s Version:

k3s version v1.29.3+k3s1 (8aecc26b) go version go1.21.8

Node(s) CPU architecture, OS, and Version: Arm 64 Amd 64 Ubuntu 22.04

Cluster Configuration: 3 servers, software defined networking behind NAT with public IP 3 public agents

flannel backend wireguard-native, ports are allowed in NAT and nodes servers are tagged with external IP flag which is set to the NATs IP

Describe the bug: Multiple:

Pods from the agent nodes can't reach pods on the server nodes. Can't get logs from server nodes due to :

➜ ~ kubectl -n kube-system logs metallb-speaker-gnc5j Error from server: Get "https://<public-ip-nat>:10250/containerLogs/kube-system/metallb-speaker-gnc5j/metallb-speaker": proxy error from 127.0.0.1:6443 while dialing <public-ip-nat>:10250, code 502: 502 Bad Gateway

Steps To Reproduce:

  • Installed K3s: curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="server" sh -s - --flannel-backend wireguard-native --token <token> --disable servicelb --write-kubeconfig-mode 644 --node-external-ip <public-ip-nat> --flannel-external-ip --disable traefik Installed a helm chart for metallb, and a daemonset for speakers is deployed

Expected behavior: The pods are able to communicate with each other, I'm able to get logs from all pods.

Actual behavior: As described in "bug", pods can't speak to each other and I can't get logs from pods on the server nodes

Additional context / logs: Which logs would help? Happy to supply whatever is needed

NandoTheessen avatar Apr 23 '24 13:04 NandoTheessen

Did you set --node-ip and --node-external-ip to the correct values for each of the agents, or just the servers?

Based on the information you shared, it sounds like the apiserver is trying to connect to the kubelet's external IP to get logs. Normally it would connect to the public IP using the agent tunnel, so I suspect that the internal and external IPs are not being set properly.

brandond avatar Apr 23 '24 19:04 brandond

Thanks for the help Brandon!

For the servers, node-ip defaults to the private IP (I believed) which would be 192.x.x.. The agents don't have these set as they only have public IP addresses which are used as the nodes IP addresses.

Should I set node-ip and node-external-ip specifically to their public addresses?

This is the current setup

Server node 1: node-ip not set, node-external-ip set to NAT gateway
Server node 2: node-ip not set, node-external-ip set to NAT gateway
Server node 3: node-ip not set, node-external-ip set to NAT gateway

Agent 1: node-ip, node-external-ip not set but only public iP
Agent 2: node-ip, node-external-ip not set but only public iP
Agent 3: node-ip, node-external-ip not set but only public iP

NandoTheessen avatar Apr 24 '24 07:04 NandoTheessen

Related to #7355 I think. Base on the comment https://github.com/k3s-io/k3s/issues/7355#issuecomment-1523635066, I'm still unable to get it working right.

tdtgit avatar Apr 25 '24 06:04 tdtgit

Thanks for linking that issue @tdtgit ! I don't think it is the same, but it helped me identify the issue a little bit better. I'm not entirely sure if what I'm trying to achieve is even possible mind you, concretely this is where I'm doubtful:

Since I only have one NAT gateway, I only have one public IP address. So this is my server config:

MY_EXTERNAL_IP=80.xxx.xxx.xxx
server 1: --node-ip 192.168.88.2 --node-external-ip ${MY_EXTERNAL_IP} --flannel-external-ip
server 2:  --node-ip <internal-ip> --node-external-ip ${MY_EXTERNAL_IP} --flannel-external-ip
server 3:  --node-ip <internal-ip> --node-external-ip ${MY_EXTERNAL_IP} --flannel-external-ip

My agent config:

agent 1: --node-ip 80.xxx.xxx.xxx --node-external-ip 80.xxx.xxx.xxx --server https://80.xxx.xxx.xxx:6443
agent 2: --node-ip 80.xxx.xxx.xxx --node-external-ip 80.xxx.xxx.xxx --server https://80.xxx.xxx.xxx:6443
agent 3: --node-ip 80.xxx.xxx.xxx --node-external-ip 80.xxx.xxx.xxx --server https://80.xxx.xxx.xxx:6443

Here is some additional information:

  • Pods deployed on server1 - server 3 are not able to contact services (f.e. the kubernetes API)
  • When running wg show I get this output (server 3):
interface: flannel-wg
  public key: xxxx
  private key: (hidden)
  listening port: 51820

peer: xxxx
  endpoint: 80.xxx.xxx.xxx:51820
  allowed ips: 10.42.3.0/24
  latest handshake: 41 seconds ago
  transfer: 1.96 KiB received, 764 B sent
  persistent keepalive: every 25 seconds

peer: xxxx
  endpoint: 80.xxx.xxx.xxx:51820
  allowed ips: 10.42.4.0/24
  latest handshake: 1 minute, 27 seconds ago
  transfer: 1.23 KiB received, 1.27 KiB sent
  persistent keepalive: every 25 seconds

peer: xxxx
  endpoint: 80.xxx.xxx.xxx:51820
  allowed ips: 10.42.5.0/24
  latest handshake: 1 minute, 38 seconds ago
  transfer: 556.30 KiB received, 282.04 KiB sent
  persistent keepalive: every 25 seconds

peer: xxxx
  endpoint: 80.xxx.xxx.xxx:51820
  allowed ips: 10.42.0.0/24
  transfer: 0 B received, 17.63 KiB sent
  persistent keepalive: every 25 seconds

What I can see from this is, is that server three has only peered with one other server instead of two! We're missing a peer here and I assume that is related to the NAT gateway that forwards all traffic to one server (server 1).

NandoTheessen avatar Apr 26 '24 08:04 NandoTheessen

I've indeed managed to fix this by assigning public IPs to all of my servers. I have one last issue that persists though.

I can read the logs from all pods except the ones on server2 & server3, there I receive a "502: bad gateway" error. I'm sure this was spotted before in the wild, could you give me some pointers?

NandoTheessen avatar Apr 26 '24 15:04 NandoTheessen

I've indeed managed to fix this by assigning public IPs to all of my servers. I have one last issue that persists though.

I can read the logs from all pods except the ones on server2 & server3, there I receive a "502: bad gateway" error. I'm sure this was spotted before in the wild, could you give me some pointers?

but some agent node on nat backend, not public IP. what should i do?

liguobao avatar May 31 '24 05:05 liguobao

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

github-actions[bot] avatar Jul 15 '24 20:07 github-actions[bot]

My edge nodes do not have public IP addresses, it seems that there is a problem with the WireGuard setup built by k3s for me。

peer: 6e/mB endpoint: PubIP:51820 allowed ips: 10.42.6.0/24 transfer: 0 B received, 1.02 MiB sent persistent keepalive: every 25 seconds

peer: AaX5oW4pt1X endpoint: 192.168.60.142:51820 allowed ips: 10.42.8.0/24 transfer: 0 B received, 599.95 KiB sent persistent keepalive: every 25 seconds

liguobao avatar Jul 28 '24 16:07 liguobao

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

github-actions[bot] avatar Sep 12 '24 20:09 github-actions[bot]