k3s icon indicating copy to clipboard operation
k3s copied to clipboard

Default Pod CIDR seems to be 10.42.0.0/24

Open elsbrock opened this issue 4 years ago • 18 comments

Describe the bug I am running k3s version v0.7.0 (61bdd852) on a beefy machine with increased pod limit. I reached the maximum number of pods (255) although the Pod CIDR is 10.42.0.0/16 according to the docs, so I'd expect to run more than that.

To Reproduce k3s server --max-pods=500; for i in $(seq 300); kubectl run --image=busybox busybox-$i; done

When creating the 255th Pod I got the following error:

0s          Warning   FailedCreatePodSandBox   pod/tiller-deploy-795bdd79b-msd5l    (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "54463c700e8bfad1eb641884bd364bad8c97b8999b4d29de00a7127b4eb9924f" network for pod "tiller-deploy-795bdd79b-msd5l": NetworkPlugin cni failed to set up pod "tiller-deploy-795bdd79b-msd5l" network: failed to allocate for range 0: no IP addresses available in range set: 10.42.0.1-10.42.0.254

And indeed, kubectl describe nodes reveals:

PodCIDR:                     10.42.0.0/24
Non-terminated Pods:         (255 in total)

Expected behavior I am able to create more pods than 255.

Additional context

I looked into the docs and was surprised to see the default is supposedly /16. I modified /etc/systemd/system/k3s.service with explicit --cluster-cidr=10.42.0.0/16:

ubuntu@bw-dh01:~$ systemctl status k3s
● k3s.service - Lightweight Kubernetes
   Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2019-08-01 12:55:36 UTC; 3min 36s ago
     Docs: https://k3s.io
  Process: 19896 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
  Process: 19076 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
 Main PID: 20490 (k3s-server)
    Tasks: 0
   CGroup: /system.slice/k3s.service
           └─20490 /usr/local/bin/k3s server --cluster-cidr=10.42.0.0/16 --kubelet-arg max-pods=500 --docker

However it does not seem to have an effect, kubectl describe nodes still says the Pod CIDR is a /8. How can I change it for an existing cluster?

elsbrock avatar Aug 01 '19 13:08 elsbrock

IIRC, that is a flannel config, by default it designates a /24 network to each node hence the 255 pods limit.

Editing: Yeah, /24 per node, more info: https://github.com/coreos/flannel/blob/master/Documentation/configuration.md

Right now I don't know if it's possible to edit the k3s flannel config, sorry about that!

vFondevilla avatar Aug 02 '19 18:08 vFondevilla

k3s should auto-create a config using the cluster cidr:

$ cat /var/lib/rancher/k3s/agent/etc/flannel/net-conf.json
{
    "Network": "10.42.0.0/16",
    "Backend": {
    "Type": "vxlan"
    }
}

erikwilson avatar Aug 02 '19 20:08 erikwilson

Ah, sorry I see looks like SubnetLen should be set also

erikwilson avatar Aug 02 '19 20:08 erikwilson

I don't think it works. One node k3s v1.0.1:

/usr/local/bin/k3s server --kubelet-arg=max-pods=500 --no-deploy=traefik,servicelb --flannel-conf=/var/lib/rancher/k3s/agent/etc/flannel/net-conf-local.json --docker

I've added SubnetLen to flannel configuration:

# cat /var/lib/rancher/k3s/agent/etc/flannel/net-conf-local.json { "Network": "10.24.0.0/16", "SubnetLen": 22, "Backend": { "Type": "vxlan" } }

And pod CIDR is still /24:

# kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}' 10.42.0.0/24

What should be set to change pod CIDR to broader network than /24 to utilize "max-pods 500" setting?

IdahoPL avatar Jan 20 '20 21:01 IdahoPL

Did you rebuild the cluster from scratch with the new config file? It looks to me like the subnet is registered when the node is created, so changing the config after the fact may not take effect.

You can also cat /run/flannel/subnet.env to see what the actual config is.

brandond avatar Jan 20 '20 22:01 brandond

Thank you for this information, didn't know that. Probably it should be documented :) Anyway, clean VM, still no luck:

# cat /opt/etc/flannel.json { "Network": "10.24.0.0/16", "SubnetLen": 22, "Backend": { "Type": "vxlan" } }

# curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.0.1" sh -s - --kubelet-arg="max-pods=500" --no-deploy=traefik,servicelb --docker --flannel-conf=/opt/etc/flannel.json

# cat /run/flannel/subnet.env FLANNEL_NETWORK=10.24.0.0/16 FLANNEL_SUBNET=10.42.0.1/24 FLANNEL_MTU=1450 FLANNEL_IPMASQ=true

IdahoPL avatar Jan 20 '20 23:01 IdahoPL

OK, so it looks like the k3s flannel implementation actually waits until the node already has a PodCIDR allocated before starting: https://github.com/rancher/k3s/blob/master/pkg/agent/flannel/setup.go#L89

This means that the PodCIDR is getting assigned somewhere else, most likely the default IPAM that's embedded in the ControllerManager: https://github.com/rancher/k3s/blob/master/vendor/k8s.io/kubernetes/pkg/controller/nodeipam/ipam/range_allocator.go The args for that get built here: https://github.com/rancher/k3s/blob/master/pkg/daemons/control/server.go#L116

There's an extra --node-cidr-mask-size option that can be passed to kube-controller-manager, which defaults to 24. It looks like we can pass this in as--kube-controller-manager-arg=node-cidr-mask-size=22

Sure enough, after starting with that option I get:

kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'
10.42.0.0/22

As far as I can tell, everything else downstream from there seems to be handled properly - you don't need to provide a custom flannel.conf or anything.

tl/dr:

  1. k3s.sh server --kube-controller-manager-arg=node-cidr-mask-size=22 --kubelet-arg=max-pods=500
  2. enjoy your beefy nodes

brandond avatar Jan 20 '20 23:01 brandond

It works after cluster rebuild, thank you!

Is there a way to change pod CIDR without cluster rebuild?

IdahoPL avatar Jan 21 '20 13:01 IdahoPL

You could probably play with changing the podCIDR attributes manually, and then restarting k3s on the nodes to regenerate the configuration. I'm not sure how grumpy that would make the integrated IPAM, but finding out should be easy enough.

brandond avatar Jan 21 '20 17:01 brandond

@brandond would you like to share how did you changed the attributes? It seems like I cannot directly patch it...

stevefan1999-personal avatar Mar 20 '20 15:03 stevefan1999-personal

What is the best way to change the netmask on a running k3s? Currently default cni0 adaptor is masked at /24 means I could only --kubelet-arg max-pods=254 at maximum

Currently my way is

## re-apply configs
systemctl stop k3s
curl -sfL https://get.k3s.io |  sh -s - --no-deploy local-storage --kube-controller-manager-arg=node-cidr-mask-size=22 --cluster-cidr=10.42.0.0/22 --service-cidr=10.43.0.0/22 --kubelet-arg max-pods=1022 --cluster-init 
k3s check-config ## <-- somewhat not checking cni0??
vim /run/flannel/subnet.env ## <-- change FLANNEL_SUBNET to 22
k3s-killall.sh ## This stops k3s and removes cni0 which was masked at /24
sudo systemctl restart k3s

For all the trouble im still getting

kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'
10.42.0.0/24

mikeccuk2005 avatar Jul 07 '20 10:07 mikeccuk2005

If the attribute cannot be edited as @stevefan1999-personal reports, then the only way would be to rebuild your k3s cluster and start the server with the arguments listed in https://github.com/rancher/k3s/issues/697#issuecomment-576463182

brandond avatar Jul 07 '20 21:07 brandond

Hi all, I've been trying to do this myself, but not rebuilding the cluster. I'm running v1.18.9+k3s1

I get as far as trying to restart, but I keep running into this. FATA[0000] flag provided but not defined: -kube-controller-manager-arg when trying to start the nodes back up with

ExecStart=/usr/local/bin/k3s \ agent --kubelet-arg=max-pods=1022 --kube-controller-manager-arg=node-cidr-mask-size=22

Any ideas please?

HarryC145 avatar Nov 08 '20 10:11 HarryC145

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

stale[bot] avatar Jul 30 '21 23:07 stale[bot]

Still an important modification to be done for big clusters.

unixfox avatar Jul 31 '21 02:07 unixfox

For the issue reported above, only servers run kube-controller-manager, which is why you can't set that arg on agents.

More generally speaking, I'm not sure Kubernetes itself supports changing cidr assignments after the cluster is established.

brandond avatar Jul 31 '21 17:07 brandond

There's an extra --node-cidr-mask-size option that can be passed to kube-controller-manager, which defaults to 24. It looks like we can pass this in as--kube-controller-manager-arg=node-cidr-mask-size=22

Is there any document about all options of kube-controller-manager-arg?

kirbyzhou avatar Dec 03 '21 10:12 kirbyzhou

@kirbyzhou yes, in the Kubernetes controller-manager documentation.

brandond avatar Dec 03 '21 11:12 brandond

I intended to open a new issue, but since this one exactly matches my problem, and it is still open, I'll just comment here:

I just followed Quick Start guide instructions and installed a control node on a brand new CentOS 7 VM (K3S version: v1.25.6+k3s1), with the following command:

curl -sfL https://get.k3s.io | sh -

This should install a cluster in which, according to the official documentation, the Pod network is 10.42.0.0/16.

Instead, the cluster is created with a Pod CIDR of 10.42.0.0/24 (which of course limits the Pod number to 255 or so).

I tried instead to create the cluster with:

curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="server --cluster-cidr=10.42.0.0/16 --service-cidr=10.43.0.0/16" sh -

(According to documentation, the --service-cidr and --cluster-cidr arguments should not be necessary, since the provided values are the exact defaults)

But again: the cluster is created with 10.42.0.0/24

I have already read the considerations above on how to create the cluster with the correct CIDR, but I suggest that either the docs are fixed, so that they match what the default configuration does, or the default configuration is changed to match the documentation.

it also seems that the parameters --cluster-cidr and --service-cidr are not being honored...?

Thanks J.

jorgegv avatar Jan 27 '23 08:01 jorgegv

@jorgegv did you read any of the conversation up above?

The default cluster cidr is 10.42.0.0/16. The default node cidr mask is 24, so each node gets a block sub-allocated from the /16 starting at 10.42.0.0/24 which is probably what you're looking at and making assumptions about. None of this can be easily changed once the cluster has been started. This is all discussed in the comments you just replied to.

brandond avatar Jan 27 '23 09:01 brandond

Update: I achieved creating a /16 network cluster with:

curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--kubelet-arg=max-pods=65534 --kube-controller-manager-arg=node-cidr-mask-size=16" sh -

With that command line, when I kubectl describe node I get the correct CIDR (apparently).

But afther that, if I follow the documentation and try to add a worker node with:

curl -sfL https://get.k3s.io | K3S_URL=https://__REDACTED__:6443 K3S_TOKEN=__MY_TOKEN__ sh -

...the worker nodes get stuck when starting the service. The appear as ready with get node, but the k3s-agent process is stuck, with these messages:

Jan 27 09:15:07 k8s-worker-1.__REDACTED__ k3s[2015]: E0127 09:15:07.801167    2015 kuberuntime_manager.go:772] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"ba1025fb23136d071f0345a62ad0169e4e0ddd08a568cbb7fec5b0b4fb4a82a3\": plugin type=\"flannel\" failed (add): open /run/flannel/subnet.env: no such file or directory" pod="kube-system/svclb-traefik-0d1389a6-trsnt"
Jan 27 09:15:07 k8s-worker-1.__REDACTED__ k3s[2015]: E0127 09:15:07.801201    2015 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"svclb-traefik-0d1389a6-trsnt_kube-system(faa2e95d-1a38-4a06-946b-461109d22e68)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"svclb-traefik-0d1389a6-trsnt_kube-system(faa2e95d-1a38-4a06-946b-461109d22e68)\\\": rpc error: code = Unknown desc = failed to setup network for sandbox \\\"ba1025fb23136d071f0345a62ad0169e4e0ddd08a568cbb7fec5b0b4fb4a82a3\\\": plugin type=\\\"flannel\\\" failed (add): open /run/flannel/subnet.env: no such file or directory\"" pod="kube-system/svclb-traefik-0d1389a6-trsnt" podUID=faa2e95d-1a38-4a06-946b-461109d22e68

jorgegv avatar Jan 27 '23 09:01 jorgegv

Yes, you just gave each node a /16 out of the larger /16 range. That gives you room for one node, the server. There are no sub-ranges left for the agent in a /16 sub-divided into smaller /16s. This is kinda basic math.

brandond avatar Jan 27 '23 09:01 brandond

@jorgegv did you read any of the conversation up above?

The default cluster cidr is 10.42.0.0/16. The default node cidr mask is 24, so each node gets a block sub-allocated from the /16 starting at 10.42.0.0/24 which is probably what you're looking at and making assumptions about. None of this can be easily changed once the cluster has been started. This is all discussed in the comments you just replied to.

There is no need to be rude. Did you read my commen at the end?

"I have already read the considerations above on how to create the cluster with the correct CIDR, but I suggest that either the docs are fixed, so that they match what the default configuration does, or the default configuration is changed to match the documentation."

So yes, I read it, and yes, I did create a /16 cluster. I'm just suggesting to check that the published installation procedure be reviewed, because the mentioned defaults do not match the docs.

And again, reading your next comment you wrote while I was writing this one: no need to be rude.

jorgegv avatar Jan 27 '23 09:01 jorgegv

Thanks for explaining, though. I understand it now.

jorgegv avatar Jan 27 '23 09:01 jorgegv

"The default node cidr mask is 24, so each node gets a block sub-allocated from the /16 starting at 10.42.0.0/24"

This was the critical part that was not included in the comments above, before you mentioned it. And thanks again, @brandond .

jorgegv avatar Jan 27 '23 09:01 jorgegv

The docs do not need to be fixed, they are correct. The Service CIDR is 10.43.0.0/16; ClusterIP addresses are allocated out of this range. The Cluster CIDR is 10.42.0.0/16, and each node is allocated a /24 out of this range for their pods. These things can all be configured with the appropriate args. There's even an example at https://github.com/k3s-io/k3s/issues/697#issuecomment-576463182

brandond avatar Jan 27 '23 09:01 brandond

As I said in my previous comment, "each node gets a block sub-allocated from the /16 starting at 10.42.0.0/24" was the missing information for me. I did not know that.

Oh, and I agree: the docs do not need to be fixed.

jorgegv avatar Jan 27 '23 09:01 jorgegv

Apologies if my responses were a bit heated. It can be frustrating to hear that our documentation is wrong and the software is not working, when it looks like the issue is someone not taking the time to read the discussion and examples given in the year-dead thread that they are responding to.

brandond avatar Jan 27 '23 09:01 brandond

Ok, never mind. Perhaps your sentence "The Service CIDR is 10.43.0.0/16; ClusterIP addresses are allocated out of this range. The Cluster CIDR is 10.42.0.0/16, and each node is allocated a /24 out of this range for their pods. These things can all be configured with the appropriate args. " or a similar one can be added to the docs as a clarification? I have really spent days reading most of them, but I don't remember seeing this piece of information anywhere. I'd be more than glad to prepare a PR for it.

Of course, now that I know, it seems pretty obvious that this schema or a similar one should be used for IP management inside a cluster.

Just a thought, and thanks again for your time.

jorgegv avatar Jan 27 '23 09:01 jorgegv

Closing as there appears to be a workaround for expected upstream behavior

caroline-suse-rancher avatar Feb 21 '23 14:02 caroline-suse-rancher