kind Configure capacity of the worker nodes

Configure capacity of the worker nodes

Open palade opened this issue 4 years ago • 19 comments

Would be possible to set the capacity of the worker nodes when the cluster is created?

Sep 27 '19 11:09 palade

can you elaborate a bit more? what's your use case?

Sep 27 '19 14:09 aojea

@aojea Doing some scheduler work and would like to consider the CPU and memory capacities of each node. I could use labels for this but was wondering if it is possible to do this when the cluster is setup? Also if labels is the only option, would be possible to tag each node with particular labels from the initialisation script?

Sep 27 '19 14:09 palade

well, that seems interesting.@BenTheElder what do you think? Basically the worker nodes are docker containers, so we should be able to use docker resource constrains to limit them https://docs.docker.com/config/containers/resource_constraints/ However, I don't know how this will work with nested cgroups :thinking:

Sep 27 '19 14:09 aojea

I don't know how this will work with nested cgroups

I might be wrong, but I don't think setting resource upper bounds will impact the current cgroup architecture. I do see performance issues with starving the node of resources, though.

I'm thinking about the UX side of things too; Docker resource constraints are pretty granular. Maybe we only expose some subset of the constraints, or maybe abstract them all together?

Sep 28 '19 15:09 WalkerGriggs

Feel free to try this out but IIRC this doesn't work.

Similarly if swap is enabled on the host memory limits won't work on your pods either.

Sep 28 '19 15:09 BenTheElder

I'm working on decoupling us from docker's command line, when we experiment again with support for ignite and other backends when that is complete, some of those can actually limit things because while they are based around running container images they use VMs :+)

Sep 28 '19 15:09 BenTheElder

docker resource constraints are working for me with swap, I'll send a PR implementing it I have one node limited to 100M in this example

Oct 01 '19 12:10 aojea

/assign

Oct 01 '19 12:10 aojea

docker resource constraints are working for me with swap, I'll send a PR implementing it I have one node limited to 100M in this example

That of course works but ... does it actually limit everything on the node? Have you deployed a pod trying to use more? What does kubelet report?

Oct 01 '19 15:10 BenTheElder

kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
nodes:
# the control plane node
- role: control-plane
- role: worker
  constraints:
    memory: "100m"
    cpu: "1"

from https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/#specify-a-memory-request-and-a-memory-limit

I modify to use directly and try to use 1.5g memory:

apiVersion: v1
kind: Pod
metadata:
  name: memory-demo
  namespace: mem-example
spec:
  containers:
  - name: memory-demo-ctr
    image: polinux/stress
    command: ["stress"]
    args: ["--vm", "1", "--vm-bytes", "1500M", "--vm-hang", "1"]

the pod takes more than 4 mins to be created, it doesn't seem to be a hard limit, maybe we should tweak something on cgroups, but checking inside the node it really seems is limiting the memory

asks:  19 total,   1 running,  18 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us,  2.5 sy,  0.0 ni, 16.7 id, 80.3 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  32147.3 total,  16816.6 free,   1885.6 used,  13445.2 buff/cache
MiB Swap:   2055.0 total,    901.4 free,   1153.6 used.  29866.1 avail Mem

USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                
root      20   0  140504   4916      0 S   4.3   0.0   1:16.80 kube-proxy
root      20   0  130236   1720      0 D   3.7   0.0   0:30.99 kindnetd
root      20   0 2214724  70912  60684 S   3.3   0.2   0:37.25 kubelet
root      20   0 1587948  37516     24 D   3.0   0.1   0:36.98 stress
root      20   0 2210024  30812  23940 S   2.7   0.1   0:34.11 containerd
root      20   0    9336   4180   4180 S   1.3   0.0   0:01.93 containerd-shim
root      20   0   10744   4180   4180 S   0.7   0.0   0:01.70 containerd-shim
root      19  -1   22656   6684   6508 S   0.3   0.0   0:01.78 systemd-journal
root      20   0    6024   2756   2648 R   0.3   0.0   0:00.11 top                    
root      20   0   17524   7688   7688 S   0.0   0.0   0:00.53 systemd
root      20   0   10744   4180   4180 S   0.0   0.0   0:02.67 containerd-shim
root      20   0    1024      0      0 S   0.0   0.0   0:00.00 pause
root      20   0    9336   4180   4180 S   0.0   0.0   0:02.23 containerd-shim
root      20   0    1024      0      0 S   0.0   0.0   0:00.00 pause
root      20   0   10744   4608   4564 S   0.0   0.0   0:00.81 containerd-shim
root      20   0    1024      0      0 S   0.0   0.0   0:00.00 pause
root      20   0   10744   3980   3980 S   0.0   0.0   0:00.91 containerd-shim
root      20   0     744      0      0 S   0.0   0.0   0:00.06 stress
root      20   0    4052   2936   2936 S   0.0   0.0   0:00.05 bash

Oct 01 '19 16:10 aojea

Looking at the kernel docs it seems that this is throttling https://www.kernel.org/doc/Documentation/cgroup-v1/blkio-controller.txt , check the block I/o stats

ONTAINER ID        NAME                 CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS                        
1698a9d1be92        kind-worker          14.64%              99.42MiB / 100MiB     99.42%              4.34MB / 361kB      1.91GB / 1.04GB     155                         
1a1a6fb0f69a        kind-control-plane   6.75%               1.268GiB / 31.39GiB   4.04%               512kB / 2.03MB      0B / 81.7MB         392

do we want this? or is the idea to fail if it overcommit?

Oct 01 '19 16:10 aojea

I think that there are several optison:

use a provider that use VMs for the nodes
implement something like lxcfs to "fake" the resources and cheat cadvisor and the kubelet

otherwise you can set the limit manually as explained here https://github.com/kubernetes-sigs/kind/issues/1524

using container constraints (cgroups) is only valid for limiting the resources, but kubelet keeps using the whole host memory and cpu resources for its calculations.

Sep 09 '20 07:09 aojea

using container constraints (cgroups) is only valid for limiting the resources, but kubelet keeps using the whole host memory and cpu resources for its calculations.

Hello @aojea , This PR on cAdvisor adress this point. I hope this will help. Thanks

Oct 14 '20 09:10 louiznk

using container constraints (cgroups) is only valid for limiting the resources, but kubelet keeps using the whole host memory and cpu resources for its calculations.

Hello @aojea , This PR on cAdvisor adress this point. I hope this will help. Thanks

that sounds nice, do you think it has chances to be approved?

Oct 14 '20 10:10 aojea

using container constraints (cgroups) is only valid for limiting the resources, but kubelet keeps using the whole host memory and cpu resources for its calculations.

Hello @aojea , This PR on cAdvisor adress this point. I hope this will help. Thanks

that sounds nice, do you think it has chances to be approved?

I hope 🤷🏻‍♂️

Oct 14 '20 10:10 louiznk

Sadly no re: cAdvisor. This doesn't leave us with spectactular options. Maybe we can trick kubelet into reading our own ""vfs"" or something (like lxcfs?) 😬 , semi related: #2318's solution.

Jun 24 '21 08:06 BenTheElder

Doing some scheduler work and would like to consider the CPU and memory capacities of each node. I could use labels for this...

@palade Did you mean we can limit node's CPU and memory capacities provided to kubernetes cluster by assigning some labels to node? which label you use? Can you give me an example? Thanks a lot.

Apr 13 '22 06:04 LambertZhaglog

any progress? Will still be able to do this?

Oct 20 '23 08:10 hwdef

https://github.com/kubernetes/kubernetes/issues/120832

Dec 07 '23 16:12 BenTheElder

kind kind copied to clipboard

Configure capacity of the worker nodes

kind
kind copied to clipboard