k3d icon indicating copy to clipboard operation
k3d copied to clipboard

[BUG] too many open files

Open dennis-ge opened this issue 3 years ago • 3 comments

What did you do

  • How was the cluster created?

    • k3d registry create own-registry.localhost -p 5000
    • k3d cluster create c1 --kubeconfig-update-default --timeout 300s --agents 1 --k3s-arg --disable=traefik@server:0 --image rancher/k3s:v1.20.11-k3s1 --registry-use own-registry.localhost:5000 --port 80:80@loadbalancer --port 443:443@loadbalancer --verbose
  • What did you do afterwards?

    • delete the cluster

What did you expect to happen

I expected the cluster to be created successfully. I know this is related to the limits of the OS/docker file descriptors. Increasing the limit with ulimit -n 512 (or any higher number) fixes the issue. However, I am wondering if there is a more elegant way of handling this without the need to manually adjust the ulimit?

The current workaround is to create the cluster without agents (so executing k3d cluster create c1 --kubeconfig-update-default --timeout 300s --k3s-arg --disable=traefik@server:0 --image rancher/k3s:v1.20.11-k3s1 --registry-use own-registry.localhost:5000 --port 80:80@loadbalancer --port 443:443@loadbalancer). By doing so, no ulimit changes are required.

Screenshots or terminal output

DEBU[0000] Runtime Info:
&{Name:docker Endpoint:/var/run/docker.sock Version:20.10.8 OSType:linux OS:Docker Desktop Arch:x86_64 CgroupVersion:1 CgroupDriver:cgroupfs Filesystem:extfs}
DEBU[0000] Additional CLI Configuration:
cli:
  api-port: ""
  env: []
  k3s-node-labels: []
  k3sargs:
  - --disable=traefik@server:0
  ports:
  - 80:80@loadbalancer
  - 443:443@loadbalancer
  registries:
    create: ""
  runtime-labels: []
  volumes: []
DEBU[0000] Configuration:
agents: 1
image: rancher/k3s:v1.20.11-k3s1
network: ""
options:
  k3d:
    disableimagevolume: false
    disableloadbalancer: false
    disablerollback: false
    loadbalancer:
      configoverrides: []
    timeout: 5m0s
    wait: true
  kubeconfig:
    switchcurrentcontext: true
    updatedefaultkubeconfig: true
  runtime:
    agentsmemory: ""
    gpurequest: ""
    serversmemory: ""
registries:
  config: ""
  use:
  - own-registry.localhost:5000
servers: 1
subnet: ""
token: ""
DEBU[0000] ========== Simple Config ==========
{TypeMeta:{Kind:Simple APIVersion:k3d.io/v1alpha3} Name: Servers:1 Agents:1 ExposeAPI:{Host: HostIP: HostPort:} Image:rancher/k3s:v1.20.11-k3s1 Network: Subnet: ClusterToken: Volumes:[] Ports:[] Options:{K3dOptions:{Wait:true Timeout:5m0s DisableLoadbalancer:false DisableImageVolume:false NoRollback:false NodeHookActions:[] Loadbalancer:{ConfigOverrides:[]}} K3sOptions:{ExtraArgs:[] NodeLabels:[]} KubeconfigOptions:{UpdateDefaultKubeconfig:true SwitchCurrentContext:true} Runtime:{GPURequest: ServersMemory: AgentsMemory: Labels:[]}} Env:[] Registries:{Use:[own-registry.localhost:5000] Create:<nil> Config:}}
==========================
DEBU[0000] ========== Merged Simple Config ==========
{TypeMeta:{Kind:Simple APIVersion:k3d.io/v1alpha3} Name: Servers:1 Agents:1 ExposeAPI:{Host: HostIP: HostPort:53947} Image:rancher/k3s:v1.20.11-k3s1 Network: Subnet: ClusterToken: Volumes:[] Ports:[{Port:443:443 NodeFilters:[loadbalancer]} {Port:80:80 NodeFilters:[loadbalancer]}] Options:{K3dOptions:{Wait:true Timeout:5m0s DisableLoadbalancer:false DisableImageVolume:false NoRollback:false NodeHookActions:[] Loadbalancer:{ConfigOverrides:[]}} K3sOptions:{ExtraArgs:[{Arg:--disable=traefik NodeFilters:[server:0]}] NodeLabels:[]} KubeconfigOptions:{UpdateDefaultKubeconfig:true SwitchCurrentContext:true} Runtime:{GPURequest: ServersMemory: AgentsMemory: Labels:[]}} Env:[] Registries:{Use:[own-registry.localhost:5000] Create:<nil> Config:}}
==========================
INFO[0000] portmapping '443:443' targets the loadbalancer: defaulting to [servers:*:proxy agents:*:proxy]
INFO[0000] portmapping '80:80' targets the loadbalancer: defaulting to [servers:*:proxy agents:*:proxy]
DEBU[0000] generated loadbalancer config:
ports:
  80.tcp:
  - k3d-c1-server-0
  - k3d-c1-agent-0
  443.tcp:
  - k3d-c1-server-0
  - k3d-c1-agent-0
  6443.tcp:
  - k3d-c1-server-0
settings:
  workerConnections: 1024
DEBU[0000] ===== Merged Cluster Config =====
&{TypeMeta:{Kind: APIVersion:} Cluster:{Name:c1 Network:{Name:k3d-c1 ID: External:false IPAM:{IPPrefix:zero IPPrefix IPsUsed:[] Managed:false} Members:[]} Token: Nodes:[0xc00019c600 0xc00019d500 0xc00019d680] InitNode:<nil> ExternalDatastore:<nil> KubeAPI:0xc00033eec0 ServerLoadBalancer:0xc000319630 ImageVolume:} ClusterCreateOpts:{DisableImageVolume:false WaitForServer:true Timeout:5m0s DisableLoadBalancer:false GPURequest: ServersMemory: AgentsMemory: NodeHooks:[] GlobalLabels:map[app:k3d] GlobalEnv:[] Registries:{Create:<nil> Use:[0xc000379980] Config:<nil>}} KubeconfigOpts:{UpdateDefaultKubeconfig:true SwitchCurrentContext:true}}
===== ===== =====
DEBU[0000] ===== Processed Cluster Config =====
&{TypeMeta:{Kind: APIVersion:} Cluster:{Name:c1 Network:{Name:k3d-c1 ID: External:false IPAM:{IPPrefix:zero IPPrefix IPsUsed:[] Managed:false} Members:[]} Token: Nodes:[0xc00019c600 0xc00019d500 0xc00019d680] InitNode:<nil> ExternalDatastore:<nil> KubeAPI:0xc00033eec0 ServerLoadBalancer:0xc000319630 ImageVolume:} ClusterCreateOpts:{DisableImageVolume:false WaitForServer:true Timeout:5m0s DisableLoadBalancer:false GPURequest: ServersMemory: AgentsMemory: NodeHooks:[] GlobalLabels:map[app:k3d] GlobalEnv:[] Registries:{Create:<nil> Use:[0xc000379980] Config:<nil>}} KubeconfigOpts:{UpdateDefaultKubeconfig:true SwitchCurrentContext:true}}
===== ===== =====
DEBU[0000] '--kubeconfig-update-default set: enabling wait-for-server
INFO[0000] Prep: Network
INFO[0000] Created network 'k3d-c1'
INFO[0000] Created volume 'k3d-c1-images'
DEBU[0000] Trying to find registry own-registry.localhost
DEBU[0000] no netlabel present on container /k3d-own-registry.localhost
DEBU[0000] failed to get IP for container /k3d-own-registry.localhost as we couldn't find the cluster network
DEBU[0000] no netlabel present on container /k3d-own-registry.localhost
DEBU[0000] failed to get IP for container /k3d-own-registry.localhost as we couldn't find the cluster network
DEBU[0000] no netlabel present on container /k3d-own-registry.localhost
DEBU[0000] failed to get IP for container /k3d-own-registry.localhost as we couldn't find the cluster network
INFO[0000] Starting new tools node...
DEBU[0000] Created container k3d-c1-tools (ID: 35d9102fe7fe369e35fe52b31a0b21c6bef7f931ec39a6a62182c171a4b40de5)
DEBU[0000] Node k3d-c1-tools Start Time: 2021-10-14 14:40:37.536194 +0200 CEST m=+0.417687421
INFO[0000] Starting Node 'k3d-c1-tools'
DEBU[0000] Truncated 2021-10-14 12:40:38.066550848 +0000 UTC to 2021-10-14 12:40:38 +0000 UTC
INFO[0001] Creating node 'k3d-c1-server-0'
DEBU[0001] DockerHost:
DEBU[0001] Created container k3d-c1-server-0 (ID: 58fc2f76961e17f0f5d6f943ac6436b0602a235384b9dede9cc9991be87d3521)
DEBU[0001] Created node 'k3d-c1-server-0'
INFO[0001] Creating node 'k3d-c1-agent-0'
DEBU[0001] Created container k3d-c1-agent-0 (ID: 4d8369abe37882d549cba2fd88dab49068620460a1812fdaf1edd3a94ed106ad)
DEBU[0001] Created node 'k3d-c1-agent-0'
INFO[0001] Creating LoadBalancer 'k3d-c1-serverlb'
DEBU[0001] Created container k3d-c1-serverlb (ID: bb544230c74c2ec0ab4515f5c981a6ba8df5fd71f4dfd5455e098abd87cbb144)
DEBU[0001] Created loadbalancer 'k3d-c1-serverlb'
INFO[0001] Using the k3d-tools node to gather environment information
DEBU[0001] no netlabel present on container /k3d-c1-tools
DEBU[0001] failed to get IP for container /k3d-c1-tools as we couldn't find the cluster network
DEBU[0001] no netlabel present on container /k3d-c1-tools
DEBU[0001] failed to get IP for container /k3d-c1-tools as we couldn't find the cluster network
DEBU[0001] Executing command '[sh -c getent ahostsv4 'host.docker.internal']' in node 'k3d-c1-tools'
DEBU[0002] Exec process in node 'k3d-c1-tools' exited with '0'
DEBU[0002] Hostname 'host.docker.internal' -> Address '192.168.65.2'
INFO[0002] Starting cluster 'c1'
INFO[0002] Starting servers...
DEBU[0002] Deleting node k3d-c1-tools ...
DEBU[0002] No fix enabled.
DEBU[0002] Node k3d-c1-server-0 Start Time: 2021-10-14 14:40:39.880629 +0200 CEST m=+2.762233983
INFO[0002] Deleted k3d-c1-tools
INFO[0002] Starting Node 'k3d-c1-server-0'
DEBU[0003] Truncated 2021-10-14 12:40:40.507381917 +0000 UTC to 2021-10-14 12:40:40 +0000 UTC
DEBU[0003] Waiting for node k3d-c1-server-0 to get ready (Log: 'k3s is up and running')
DEBU[0009] Finished waiting for log message 'k3s is up and running' from node 'k3d-c1-server-0'
INFO[0009] Starting agents...
DEBU[0009] No fix enabled.
DEBU[0009] Node k3d-c1-agent-0 Start Time: 2021-10-14 14:40:46.958177 +0200 CEST m=+9.840119863
INFO[0010] Starting Node 'k3d-c1-agent-0'
DEBU[0010] Truncated 2021-10-14 12:40:47.582785575 +0000 UTC to 2021-10-14 12:40:47 +0000 UTC
DEBU[0010] Waiting for node k3d-c1-agent-0 to get ready (Log: 'Successfully registered node')
DEBU[0022] Finished waiting for log message 'Successfully registered node' from node 'k3d-c1-agent-0'
INFO[0022] Starting helpers...
DEBU[0022] Node k3d-c1-serverlb Start Time: 2021-10-14 14:40:59.83467 +0200 CEST m=+22.717227947
INFO[0022] Starting Node 'k3d-c1-serverlb'
DEBU[0023] Truncated 2021-10-14 12:41:00.478078007 +0000 UTC to 2021-10-14 12:41:00 +0000 UTC
DEBU[0023] Waiting for node k3d-c1-serverlb to get ready (Log: 'start worker processes')
DEBU[0029] FIXME: Got an status-code for which error does not match any expected type!!!: -1  module=api status_code=-1
ERRO[0029] Failed Cluster Start: Failed to add one or more helper nodes: Node k3d-c1-serverlb failed to get ready: Failed waiting for log message 'start worker processes' from node 'k3d-c1-serverlb': failed ton inspect container 'bb544230c74c2ec0ab4515f5c981a6ba8df5fd71f4dfd5455e098abd87cbb144': error during connect: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/bb544230c74c2ec0ab4515f5c981a6ba8df5fd71f4dfd5455e098abd87cbb144/json": dial unix /var/run/docker.sock: socket: too many open files
ERRO[0029] Failed to create cluster >>> Rolling Back
INFO[0029] Deleting cluster 'c1'
WARNING: Error loading config file: /Users/D073497/.docker/config.json: open /Users/D073497/.docker/config.json: too many open files
DEBU[0029] FIXME: Got an status-code for which error does not match any expected type!!!: -1  module=api status_code=-1
ERRO[0029] Failed to get nodes for cluster 'c1': docker failed to get containers with labels 'map[k3d.cluster:c1]': failed to list containers: error during connect: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/json?all=1&filters=%7B%22label%22%3A%7B%22app%3Dk3d%22%3Atrue%2C%22k3d.cluster%3Dc1%22%3Atrue%7D%7D&limit=0": dial unix /var/run/docker.sock: socket: too many open files
ERRO[0029] failed to get cluster: No nodes found for given cluster
FATA[0029] Cluster creation FAILED, also FAILED to rollback changes!

Which OS & Architecture

MacOS

Which version of k3d

$ k3d version
k3d version v5.0.1
k3s version v1.21.5-k3s2 (default)

Which version of docker

$ docker version
Client:
 Cloud integration: 1.0.17
 Version:           20.10.8
 API version:       1.41
 Go version:        go1.16.6
 Git commit:        3967b7d
 Built:             Fri Jul 30 19:55:20 2021
 OS/Arch:           darwin/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.8
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.6
  Git commit:       75249d8
  Built:            Fri Jul 30 19:52:10 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.9
  GitCommit:        e25210fe30a0a703442421b0f60afac609f950a3
 runc:
  Version:          1.0.1
  GitCommit:        v1.0.1-0-g4144b63
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

$ docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Build with BuildKit (Docker Inc., v0.6.1-docker)
  compose: Docker Compose (Docker Inc., v2.0.0-rc.3)
  scan: Docker Scan (Docker Inc., v0.8.0)

Server:
 Containers: 38
  Running: 37
  Paused: 0
  Stopped: 1
 Images: 29
 Server Version: 20.10.8
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: e25210fe30a0a703442421b0f60afac609f950a3
 runc version: v1.0.1-0-g4144b63
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.10.47-linuxkit
 Operating System: Docker Desktop
 OSType: linux
 Architecture: x86_64
 CPUs: 6
 Total Memory: 8.746GiB
 Name: docker-desktop
 ID: OOQ5:YAWJ:WD54:2OKW:K2QS:WU72:R5RR:YBPB:OFUV:NDYI:OYC3:TIUA
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 HTTP Proxy: http.docker.internal:3128
 HTTPS Proxy: http.docker.internal:3128
 Registry: <https://index.docker.io/v1/>
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

dennis-ge avatar Oct 15 '21 11:10 dennis-ge

All of my colleagues running on Mac OS who tried to install the v5 also encountered this error.

I'm using Ubuntu 20.04, but there the installation succeeded, so I'm guessing this is maybe tied to some resource limits imposed in Docker Desktop?

agustingomes avatar Oct 22 '21 07:10 agustingomes

Happens to me too, on Fedora 34:

❯ docker -v
Docker version 20.10.9, build c2ea9bc
❯ uname -r
5.13.12-200.fc34.x86_64
❯ cat /etc/os-release
NAME=Fedora
VERSION="34 (Workstation Edition)"
ID=fedora
VERSION_ID=34
VERSION_CODENAME=""
PLATFORM_ID="platform:f34"
PRETTY_NAME="Fedora 34 (Workstation Edition)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:34"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f34/system-administrators-guide/"
SUPPORT_URL="https://fedoraproject.org/wiki/Communicating_and_getting_help"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=34
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=34
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="Workstation Edition"
VARIANT_ID=workstation
❯ k3d cluster create --agents 5 --servers 3
INFO[0000] Prep: Network                                
INFO[0000] Created network 'k3d-k3s-default'            
INFO[0000] Created volume 'k3d-k3s-default-images'      
INFO[0000] Creating initializing server node            
INFO[0000] Creating node 'k3d-k3s-default-server-0'     
INFO[0000] Starting new tools node...                   
INFO[0000] Starting Node 'k3d-k3s-default-tools'        
INFO[0001] Creating node 'k3d-k3s-default-server-1'     
INFO[0002] Creating node 'k3d-k3s-default-server-2'     
INFO[0002] Creating node 'k3d-k3s-default-agent-0'      
INFO[0002] Creating node 'k3d-k3s-default-agent-1'      
INFO[0002] Creating node 'k3d-k3s-default-agent-2'      
INFO[0002] Creating node 'k3d-k3s-default-agent-3'      
INFO[0002] Creating node 'k3d-k3s-default-agent-4'      
INFO[0002] Creating LoadBalancer 'k3d-k3s-default-serverlb' 
INFO[0002] Using the k3d-tools node to gather environment information 
INFO[0002] HostIP: using network gateway...             
INFO[0002] Starting cluster 'k3s-default'               
INFO[0002] Starting the initializing server...          
INFO[0002] Starting Node 'k3d-k3s-default-server-0'     
INFO[0003] Deleted k3d-k3s-default-tools                
INFO[0003] Starting servers...                          
INFO[0003] Starting Node 'k3d-k3s-default-server-1'     
INFO[0060] Starting Node 'k3d-k3s-default-server-2'     
WARNING: Error loading config file: /home/matul/.docker/config.json: open /home/matul/.docker/config.json: too many open files
ERRO[0126] Failed Cluster Start: Failed to start server k3d-k3s-default-server-2: Node k3d-k3s-default-server-2 failed to get ready: Failed waiting for log message 'k3s is up and running' from node 'k3d-k3s-default-server-2': failed to get container for node 'k3d-k3s-default-server-2': Failed to list containers: error during connect: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/json?all=1&filters=%7B%22label%22%3A%7B%22app%3Dk3d%22%3Atrue%2C%22k3d.cluster.imageVolume%3Dk3d-k3s-default-images%22%3Atrue%2C%22k3d.cluster.network.external%3Dfalse%22%3Atrue%2C%22k3d.cluster.network.id%3Dfb0a2002461808fb629611edd8a31e8832e4d0657b440b2884504b2417904918%22%3Atrue%2C%22k3d.cluster.network.iprange%3D172.19.0.0%2F16%22%3Atrue%2C%22k3d.cluster.network%3Dk3d-k3s-default%22%3Atrue%2C%22k3d.cluster.token%3DEIyWVswHHtjRvqtsbudj%22%3Atrue%2C%22k3d.cluster.url%3Dhttps%3A%2F%2Fk3d-k3s-default-server-0%3A6443%22%3Atrue%2C%22k3d.cluster%3Dk3s-default%22%3Atrue%2C%22k3d.role%3Dserver%22%3Atrue%2C%22k3d.server.api.host%3D0.0.0.0%22%3Atrue%2C%22k3d.server.api.hostIP%3D0.0.0.0%22%3Atrue%2C%22k3d.server.api.port%3D42217%22%3Atrue%2C%22k3d.server.init%3Dfalse%22%3Atrue%2C%22k3d.version%3Dv5.0.1%22%3Atrue%7D%2C%22name%22%3A%7B%22%5E%2F%3F%28k3d-%29%3Fk3d-k3s-default-server-2%24%22%3Atrue%7D%7D&limit=0": dial unix /var/run/docker.sock: socket: too many open files 
ERRO[0126] Failed to create cluster >>> Rolling Back    
INFO[0126] Deleting cluster 'k3s-default'               
WARNING: Error loading config file: /home/matul/.docker/config.json: open /home/matul/.docker/config.json: too many open files
ERRO[0126] Failed to get nodes for cluster 'k3s-default': docker failed to get containers with labels 'map[k3d.cluster:k3s-default]': failed to list containers: error during connect: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/json?all=1&filters=%7B%22label%22%3A%7B%22app%3Dk3d%22%3Atrue%2C%22k3d.cluster%3Dk3s-default%22%3Atrue%7D%7D&limit=0": dial unix /var/run/docker.sock: socket: too many open files 
ERRO[0126] failed to get cluster: No nodes found for given cluster 
FATA[0126] Cluster creation FAILED, also FAILED to rollback changes!

matulek avatar Oct 22 '21 12:10 matulek

Hi @dennis-ge , thanks for opening this issue! Unfortunately, there's not much we can do there. I just went over the code again to ensure that k3d properly closes all connections (that required file descriptors) as soon as they're not needed anymore, but there's probably still a lot of them, especially when using multiple nodes (as k3d e.g. needs to follow the logs of every node to get status information). On Linux hosts, you can increase the limits permanently via sysctl.conf (as per https://www.ibm.com/support/pages/increasing-maximum-number-open-files-linux-host) :thinking:

iwilltry42 avatar Oct 25 '21 10:10 iwilltry42

FWIW - I recently found that in order for docker containers to go beyond the default 1024 limit, you must pass in a ulimit argument. Since K3D is founded upon docker, would this apply/help?

For context, I too am running into issues where when deploying too many applications, like a company app and Loki in K3D Loki cannot start due to too many files open.

lgass avatar Aug 26 '22 21:08 lgass

We even faced this in our GitHub Actions pipelines. It indeed has to be a "fixed" on host level, not k3d level unfortunately.

@lgass can you elaborate please? Increasing the limit with ulimit was mentioned before already. Do you have additional insights? Do you mean setting something like docker's --ulimit flag for the k3d node containers? That would be a different issue then.

iwilltry42 avatar Aug 29 '22 10:08 iwilltry42

Yes, even though you set it on the host kernel level, I (outside of K3D) have needed to tell docker to also respect a higher limit than the default (1,024) using the --ulimit parameter. Since I have faced similar issues with applications inside K3D I was just thinking that maybe this parameter could also be useful for initializing K3D containers if that has not already been taken into account.

lgass avatar Aug 29 '22 13:08 lgass