k3s
k3s copied to clipboard
Default Pod CIDR seems to be 10.42.0.0/24
Describe the bug
I am running k3s version v0.7.0 (61bdd852)
on a beefy machine with increased pod limit. I reached the maximum number of pods (255) although the Pod CIDR is 10.42.0.0/16 according to the docs, so I'd expect to run more than that.
To Reproduce
k3s server --max-pods=500; for i in $(seq 300); kubectl run --image=busybox busybox-$i; done
When creating the 255th Pod I got the following error:
0s Warning FailedCreatePodSandBox pod/tiller-deploy-795bdd79b-msd5l (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "54463c700e8bfad1eb641884bd364bad8c97b8999b4d29de00a7127b4eb9924f" network for pod "tiller-deploy-795bdd79b-msd5l": NetworkPlugin cni failed to set up pod "tiller-deploy-795bdd79b-msd5l" network: failed to allocate for range 0: no IP addresses available in range set: 10.42.0.1-10.42.0.254
And indeed, kubectl describe nodes
reveals:
PodCIDR: 10.42.0.0/24
Non-terminated Pods: (255 in total)
Expected behavior I am able to create more pods than 255.
Additional context
I looked into the docs and was surprised to see the default is supposedly /16
. I modified /etc/systemd/system/k3s.service
with explicit --cluster-cidr=10.42.0.0/16
:
ubuntu@bw-dh01:~$ systemctl status k3s
● k3s.service - Lightweight Kubernetes
Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2019-08-01 12:55:36 UTC; 3min 36s ago
Docs: https://k3s.io
Process: 19896 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
Process: 19076 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
Main PID: 20490 (k3s-server)
Tasks: 0
CGroup: /system.slice/k3s.service
└─20490 /usr/local/bin/k3s server --cluster-cidr=10.42.0.0/16 --kubelet-arg max-pods=500 --docker
However it does not seem to have an effect, kubectl describe nodes
still says the Pod CIDR is a /8
. How can I change it for an existing cluster?
IIRC, that is a flannel config, by default it designates a /24 network to each node hence the 255 pods limit.
Editing: Yeah, /24 per node, more info: https://github.com/coreos/flannel/blob/master/Documentation/configuration.md
Right now I don't know if it's possible to edit the k3s flannel config, sorry about that!
k3s should auto-create a config using the cluster cidr:
$ cat /var/lib/rancher/k3s/agent/etc/flannel/net-conf.json
{
"Network": "10.42.0.0/16",
"Backend": {
"Type": "vxlan"
}
}
Ah, sorry I see looks like SubnetLen should be set also
I don't think it works. One node k3s v1.0.1:
/usr/local/bin/k3s server --kubelet-arg=max-pods=500 --no-deploy=traefik,servicelb --flannel-conf=/var/lib/rancher/k3s/agent/etc/flannel/net-conf-local.json --docker
I've added SubnetLen to flannel configuration:
# cat /var/lib/rancher/k3s/agent/etc/flannel/net-conf-local.json { "Network": "10.24.0.0/16", "SubnetLen": 22, "Backend": { "Type": "vxlan" } }
And pod CIDR is still /24:
# kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}' 10.42.0.0/24
What should be set to change pod CIDR to broader network than /24 to utilize "max-pods 500" setting?
Did you rebuild the cluster from scratch with the new config file? It looks to me like the subnet is registered when the node is created, so changing the config after the fact may not take effect.
You can also cat /run/flannel/subnet.env
to see what the actual config is.
Thank you for this information, didn't know that. Probably it should be documented :) Anyway, clean VM, still no luck:
# cat /opt/etc/flannel.json { "Network": "10.24.0.0/16", "SubnetLen": 22, "Backend": { "Type": "vxlan" } }
# curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.0.1" sh -s - --kubelet-arg="max-pods=500" --no-deploy=traefik,servicelb --docker --flannel-conf=/opt/etc/flannel.json
# cat /run/flannel/subnet.env FLANNEL_NETWORK=10.24.0.0/16 FLANNEL_SUBNET=10.42.0.1/24 FLANNEL_MTU=1450 FLANNEL_IPMASQ=true
OK, so it looks like the k3s flannel implementation actually waits until the node already has a PodCIDR allocated before starting: https://github.com/rancher/k3s/blob/master/pkg/agent/flannel/setup.go#L89
This means that the PodCIDR is getting assigned somewhere else, most likely the default IPAM that's embedded in the ControllerManager: https://github.com/rancher/k3s/blob/master/vendor/k8s.io/kubernetes/pkg/controller/nodeipam/ipam/range_allocator.go The args for that get built here: https://github.com/rancher/k3s/blob/master/pkg/daemons/control/server.go#L116
There's an extra --node-cidr-mask-size
option that can be passed to kube-controller-manager, which defaults to 24. It looks like we can pass this in as--kube-controller-manager-arg=node-cidr-mask-size=22
Sure enough, after starting with that option I get:
kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'
10.42.0.0/22
As far as I can tell, everything else downstream from there seems to be handled properly - you don't need to provide a custom flannel.conf or anything.
tl/dr:
-
k3s.sh server --kube-controller-manager-arg=node-cidr-mask-size=22 --kubelet-arg=max-pods=500
- enjoy your beefy nodes
It works after cluster rebuild, thank you!
Is there a way to change pod CIDR without cluster rebuild?
You could probably play with changing the podCIDR attributes manually, and then restarting k3s on the nodes to regenerate the configuration. I'm not sure how grumpy that would make the integrated IPAM, but finding out should be easy enough.
@brandond would you like to share how did you changed the attributes? It seems like I cannot directly patch it...
What is the best way to change the netmask on a running k3s?
Currently default cni0 adaptor is masked at /24 means I could only --kubelet-arg max-pods=254
at maximum
Currently my way is
## re-apply configs
systemctl stop k3s
curl -sfL https://get.k3s.io | sh -s - --no-deploy local-storage --kube-controller-manager-arg=node-cidr-mask-size=22 --cluster-cidr=10.42.0.0/22 --service-cidr=10.43.0.0/22 --kubelet-arg max-pods=1022 --cluster-init
k3s check-config ## <-- somewhat not checking cni0??
vim /run/flannel/subnet.env ## <-- change FLANNEL_SUBNET to 22
k3s-killall.sh ## This stops k3s and removes cni0 which was masked at /24
sudo systemctl restart k3s
For all the trouble im still getting
kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'
10.42.0.0/24
If the attribute cannot be edited as @stevefan1999-personal reports, then the only way would be to rebuild your k3s cluster and start the server with the arguments listed in https://github.com/rancher/k3s/issues/697#issuecomment-576463182
Hi all, I've been trying to do this myself, but not rebuilding the cluster. I'm running v1.18.9+k3s1
I get as far as trying to restart, but I keep running into this. FATA[0000] flag provided but not defined: -kube-controller-manager-arg
when trying to start the nodes back up with
ExecStart=/usr/local/bin/k3s \ agent --kubelet-arg=max-pods=1022 --kube-controller-manager-arg=node-cidr-mask-size=22
Any ideas please?
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.
Still an important modification to be done for big clusters.
For the issue reported above, only servers run kube-controller-manager, which is why you can't set that arg on agents.
More generally speaking, I'm not sure Kubernetes itself supports changing cidr assignments after the cluster is established.
There's an extra
--node-cidr-mask-size
option that can be passed to kube-controller-manager, which defaults to 24. It looks like we can pass this in as--kube-controller-manager-arg=node-cidr-mask-size=22
Is there any document about all options of kube-controller-manager-arg?
@kirbyzhou yes, in the Kubernetes controller-manager documentation.
I intended to open a new issue, but since this one exactly matches my problem, and it is still open, I'll just comment here:
I just followed Quick Start guide instructions and installed a control node on a brand new CentOS 7 VM (K3S version: v1.25.6+k3s1
), with the following command:
curl -sfL https://get.k3s.io | sh -
This should install a cluster in which, according to the official documentation, the Pod network is 10.42.0.0/16
.
Instead, the cluster is created with a Pod CIDR of 10.42.0.0/24
(which of course limits the Pod number to 255 or so).
I tried instead to create the cluster with:
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="server --cluster-cidr=10.42.0.0/16 --service-cidr=10.43.0.0/16" sh -
(According to documentation, the --service-cidr
and --cluster-cidr
arguments should not be necessary, since the provided values are the exact defaults)
But again: the cluster is created with 10.42.0.0/24
I have already read the considerations above on how to create the cluster with the correct CIDR, but I suggest that either the docs are fixed, so that they match what the default configuration does, or the default configuration is changed to match the documentation.
it also seems that the parameters --cluster-cidr
and --service-cidr
are not being honored...?
Thanks J.
@jorgegv did you read any of the conversation up above?
The default cluster cidr is 10.42.0.0/16. The default node cidr mask is 24, so each node gets a block sub-allocated from the /16 starting at 10.42.0.0/24 which is probably what you're looking at and making assumptions about. None of this can be easily changed once the cluster has been started. This is all discussed in the comments you just replied to.
Update: I achieved creating a /16
network cluster with:
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--kubelet-arg=max-pods=65534 --kube-controller-manager-arg=node-cidr-mask-size=16" sh -
With that command line, when I kubectl describe node
I get the correct CIDR (apparently).
But afther that, if I follow the documentation and try to add a worker node with:
curl -sfL https://get.k3s.io | K3S_URL=https://__REDACTED__:6443 K3S_TOKEN=__MY_TOKEN__ sh -
...the worker nodes get stuck when starting the service. The appear as ready with get node
, but the k3s-agent
process is stuck, with these messages:
Jan 27 09:15:07 k8s-worker-1.__REDACTED__ k3s[2015]: E0127 09:15:07.801167 2015 kuberuntime_manager.go:772] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"ba1025fb23136d071f0345a62ad0169e4e0ddd08a568cbb7fec5b0b4fb4a82a3\": plugin type=\"flannel\" failed (add): open /run/flannel/subnet.env: no such file or directory" pod="kube-system/svclb-traefik-0d1389a6-trsnt"
Jan 27 09:15:07 k8s-worker-1.__REDACTED__ k3s[2015]: E0127 09:15:07.801201 2015 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"svclb-traefik-0d1389a6-trsnt_kube-system(faa2e95d-1a38-4a06-946b-461109d22e68)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"svclb-traefik-0d1389a6-trsnt_kube-system(faa2e95d-1a38-4a06-946b-461109d22e68)\\\": rpc error: code = Unknown desc = failed to setup network for sandbox \\\"ba1025fb23136d071f0345a62ad0169e4e0ddd08a568cbb7fec5b0b4fb4a82a3\\\": plugin type=\\\"flannel\\\" failed (add): open /run/flannel/subnet.env: no such file or directory\"" pod="kube-system/svclb-traefik-0d1389a6-trsnt" podUID=faa2e95d-1a38-4a06-946b-461109d22e68
Yes, you just gave each node a /16 out of the larger /16 range. That gives you room for one node, the server. There are no sub-ranges left for the agent in a /16 sub-divided into smaller /16s. This is kinda basic math.
@jorgegv did you read any of the conversation up above?
The default cluster cidr is 10.42.0.0/16. The default node cidr mask is 24, so each node gets a block sub-allocated from the /16 starting at 10.42.0.0/24 which is probably what you're looking at and making assumptions about. None of this can be easily changed once the cluster has been started. This is all discussed in the comments you just replied to.
There is no need to be rude. Did you read my commen at the end?
"I have already read the considerations above on how to create the cluster with the correct CIDR, but I suggest that either the docs are fixed, so that they match what the default configuration does, or the default configuration is changed to match the documentation."
So yes, I read it, and yes, I did create a /16 cluster. I'm just suggesting to check that the published installation procedure be reviewed, because the mentioned defaults do not match the docs.
And again, reading your next comment you wrote while I was writing this one: no need to be rude.
Thanks for explaining, though. I understand it now.
"The default node cidr mask is 24, so each node gets a block sub-allocated from the /16 starting at 10.42.0.0/24"
This was the critical part that was not included in the comments above, before you mentioned it. And thanks again, @brandond .
The docs do not need to be fixed, they are correct. The Service CIDR is 10.43.0.0/16; ClusterIP addresses are allocated out of this range. The Cluster CIDR is 10.42.0.0/16, and each node is allocated a /24 out of this range for their pods. These things can all be configured with the appropriate args. There's even an example at https://github.com/k3s-io/k3s/issues/697#issuecomment-576463182
As I said in my previous comment, "each node gets a block sub-allocated from the /16 starting at 10.42.0.0/24" was the missing information for me. I did not know that.
Oh, and I agree: the docs do not need to be fixed.
Apologies if my responses were a bit heated. It can be frustrating to hear that our documentation is wrong and the software is not working, when it looks like the issue is someone not taking the time to read the discussion and examples given in the year-dead thread that they are responding to.
Ok, never mind. Perhaps your sentence "The Service CIDR is 10.43.0.0/16; ClusterIP addresses are allocated out of this range. The Cluster CIDR is 10.42.0.0/16, and each node is allocated a /24 out of this range for their pods. These things can all be configured with the appropriate args. " or a similar one can be added to the docs as a clarification? I have really spent days reading most of them, but I don't remember seeing this piece of information anywhere. I'd be more than glad to prepare a PR for it.
Of course, now that I know, it seems pretty obvious that this schema or a similar one should be used for IP management inside a cluster.
Just a thought, and thanks again for your time.
Closing as there appears to be a workaround for expected upstream behavior