manifests Several pods do not start, encounter "too many open files" error

In setting up a kubeflow cluster using the master branch at commit 3dad839f. Four pods encounter too many open files error.

For the k8s cluster, I'm using a local k3d cluster on MacOS (11.6.1): https://k3d.io

At end of deploying kubeflow these are the status of 4 pods.

kubectl get pod -A | grep -v Run | grep -v NAME
kubeflow           ml-pipeline-8c4b99589-gcvmz                              1/2     CrashLoopBackOff   15         63m
kubeflow           kfserving-controller-manager-0                           1/2     CrashLoopBackOff   15         63m
kubeflow           profiles-deployment-89f7d88b-hp697                       1/2     CrashLoopBackOff   15         63m
kubeflow           katib-controller-68c47fbf8b-d6mpj                        0/1     CrashLoopBackOff   16         63m

The cluster has been torn down and rebuilt several times. Each time the same 4 pods encounter the too many open files error. All other pods successfully attain Running status.

According to ulimit -n on the nodes, the nodes have a very high setting for that limit: 1048576. Since this is run on MacOS, configured launchctl to increase the maxfiles from 256 to 524288.

I'm new to kubeflow, so any guidance offered will be appreciated.

Following are the diagnostic data collected:

Log extract from failed pods

kubectl logs ml-pipeline-8c4b99589-gcvmz
Using deprecated annotation `kubectl.kubernetes.io/default-logs-container` in pod/ml-pipeline-8c4b99589-gcvmz. Please use `kubectl.kubernetes.io/default-container` instead
2021/12/11 13:01:59 too many open files


kubectl logs kfserving-controller-manager-0 -c manager
<<<< deleted info level messages>>>>
{"level":"error","ts":1639227716.1910038,"logger":"controller-runtime.manager","msg":"error received after stop sequence was engaged","error":"Timeout: failed waiting for *v1beta1.InferenceService Informer to sync","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:547"}
{"level":"error","ts":1639227716.1911373,"logger":"controller-runtime.manager","msg":"error received after stop sequence was engaged","error":"Timeout: failed waiting for *v1alpha1.TrainedModel Informer to sync","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:547"}
{"level":"error","ts":1639227716.1912212,"logger":"entrypoint","msg":"unable to run the manager","error":"too many open files","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nmain.main\n\t/go/src/github.com/kubeflow/kfserving/cmd/manager/main.go:183\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}


kubectl logs profiles-deployment-89f7d88b-hp697 -c manager
I1211 13:02:40.188855       1 request.go:645] Throttling request took 1.036224909s, request: GET:https://10.43.0.1:443/apis/flows.knative.dev/v1?timeout=32s
2021-12-11T13:02:41.646Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": ":8080"}
2021-12-11T13:02:41.646Z	ERROR	setup	unable to create controller	{"controller": "Profile", "error": "Failed to start file watcher: too many open files", "errorVerbose": "too many open files\nFailed to start file watcher\ngithub.com/kubeflow/kubeflow/components/profile-controller/controllers.(*ProfileReconciler).SetupWithManager\n\t/workspace/controllers/profile_controller.go:381\nmain.main\n\t/workspace/main.go:93\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374"}
runtime.main
	/usr/local/go/src/runtime/proc.go:204


  kubectl logs katib-controller-68c47fbf8b-d6mpj
  <<<<<<<< removed info level messages >>>>>>>>>>>>
  {"level":"error","ts":1639227826.322595,"logger":"controller-runtime.manager","msg":"error received after stop sequence was engaged","error":"Timeout: failed waiting for *v1beta1.Suggestion Informer to sync","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:529"}
  {"level":"error","ts":1639227826.3227415,"logger":"controller-runtime.manager","msg":"error received after stop sequence was engaged","error":"Timeout: failed waiting for *v1beta1.Experiment Informer to sync","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:529"}
  {"level":"error","ts":1639227826.32281,"logger":"entrypoint","msg":"Unable to run the manager","error":"too many open files","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nmain.main\n\t/go/src/github.com/kubeflow/katib/cmd/katib-controller/v1beta1/main.go:128\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:255"}

kubeflow deployed using kustomize build ${component} | kubectl apply -f - on each of the following compnonents in the order shown:

# cert manager
common/cert-manager/cert-manager/base \
common/cert-manager/kubeflow-issuer/base \

# istio
common/istio-1-9/istio-crds/base \
common/istio-1-9/istio-namespace/base \
common/istio-1-9/istio-install/base \

#DEX
common/dex/overlays/istio \

# OIDC Auth Service
common/oidc-authservice/base \

# knative serving
common/knative/knative-serving/base \
common/istio-1-9/cluster-local-gateway/base \

# inference event logging
common/knative/knative-eventing/base \

# kubeflow namespace
common/kubeflow-namespace/base \

# kubeflow roles
common/kubeflow-roles/base \

# kubeflow istio resources
common/istio-1-9/kubeflow-istio-resources/base \

# kubeflow pipelines
apps/pipeline/upstream/env/platform-agnostic-multi-user-pns \

# KFServing
apps/kfserving/upstream/overlays/kubeflow \

# Katib
apps/katib/upstream/installs/katib-with-kubeflow \

# Central Dashboard
apps/centraldashboard/upstream/overlays/istio \

# Admission Controler
apps/admission-webhook/upstream/overlays/cert-manager \

# Notebooks
apps/jupyter/notebook-controller/upstream/overlays/kubeflow \

# Jupyter web app
apps/jupyter/jupyter-web-app/upstream/overlays/istio \

# Profiles + KFAM
apps/profiles/upstream/overlays/kubeflow \

# Volumes Web app
apps/volumes-web-app/upstream/overlays/istio \

# Tensorboard
apps/tensorboard/tensorboards-web-app/upstream/overlays/istio \

# Training Operator
apps/training-operator/upstream/overlays/kubeflow \

# User Namespace
common/user-namespace/base \

Platform

MacOS: 11.6.1
MacBookPro 2019 (Intel), 16GB RAM

Software Versions:

k3d version
k3d version v5.1.0
k3s version v1.21.5-k3s2 (default)


docker version
Client:
 Cloud integration: v1.0.22
 Version:           20.10.11
 API version:       1.41
 Go version:        go1.16.10
 Git commit:        dea9396
 Built:             Thu Nov 18 00:36:09 2021
 OS/Arch:           darwin/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.11
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.9
  Git commit:       847da18
  Built:            Thu Nov 18 00:35:39 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.12
  GitCommit:        7b11cfaabd73bb80907dd23182b9347b4245eb5d
 runc:
  Version:          1.0.2
  GitCommit:        v1.0.2-0-g52b36a2
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:48:33Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5+k3s2", GitCommit:"724ef700bab896ff252a75e2be996d5f4ff1b842", GitTreeState:"clean", BuildDate:"2021-10-05T19:59:14Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}


kustomize version
Version: {KustomizeVersion:3.2.0 GitCommit:a3103f1e62ddb5b696daa3fd359bb6f2e8333b49 BuildDate:2019-09-18T16:26:36Z GoOs:darwin GoArch:amd64}

k3d cluster nodes

kubectl get node -o wide
NAME                    STATUS   ROLES                  AGE   VERSION        INTERNAL-IP   EXTERNAL-IP   OS-IMAGE   KERNEL-VERSION     CONTAINER-RUNTIME
k3d-kubeflow-server-0   Ready    control-plane,master   79m   v1.21.5+k3s2   172.19.0.2    <none>        Unknown    5.10.76-linuxkit   containerd://1.4.11-k3s1
k3d-kubeflow-agent-0    Ready    <none>                 79m   v1.21.5+k3s2   172.19.0.3    <none>        Unknown    5.10.76-linuxkit   containerd://1.4.11-k3s1

ulimit for the two nodes

ulimit -a    # on server node

core file size (blocks)         (-c) 0
data seg size (kb)              (-d) unlimited
scheduling priority             (-e) 0
file size (blocks)              (-f) unlimited
pending signals                 (-i) 51481
max locked memory (kb)          (-l) 64
max memory size (kb)            (-m) unlimited
open files                      (-n) 1048576
POSIX message queues (bytes)    (-q) 819200
real-time priority              (-r) 0
stack size (kb)                 (-s) 8192
cpu time (seconds)              (-t) unlimited
max user processes              (-u) unlimited
virtual memory (kb)             (-v) unlimited
file locks                      (-x) unlimited

ulimit -a   # on worker node
core file size (blocks)         (-c) 0
data seg size (kb)              (-d) unlimited
scheduling priority             (-e) 0
file size (blocks)              (-f) unlimited
pending signals                 (-i) 51481
max locked memory (kb)          (-l) 64
max memory size (kb)            (-m) unlimited
open files                      (-n) 1048576
POSIX message queues (bytes)    (-q) 819200
real-time priority              (-r) 0
stack size (kb)                 (-s) 8192
cpu time (seconds)              (-t) unlimited
max user processes              (-u) unlimited
virtual memory (kb)             (-v) unlimited
file locks                      (-x) unlimited

Dec 13 '21 02:12 jimthompson5802

@jimthompson5802 I've also seen this happening in a KinD cluster I had, for the same Deployments. In my case I mitigated these errors by increasing my laptop's fs.inotify.max_user_{watches,instances} settings.

Not sure if this will also work for k3s though.

Jan 03 '22 18:01 kimwnasptd

@kimwnasptd thank you for the suggestion.

In my case I mitigated these errors by increasing my laptop's fs.inotify.max_user_{watches,instances} settings.

I just want to confirm the parameter names cited are from Linux. If this is correct, then I believe the equivalent parameter in MacOS are these

$ sysctl -a | grep "kern.maxfiles"
kern.maxfiles: 16777216
kern.maxfilesperproc: 524288

My belief on the parameter names come for this posting.

If this is the case, then the change did not seem to work. What values did you use to get KinD to work.

Again, thank you for taking the time to respond to my question.

Jan 05 '22 10:01 jimthompson5802

I am also facing similar issue on KIND. some of the pods are going to crashloopbackoff state. Error is as below: Error starting filewatcher: 'too many open files'. Configuration changes will not be detected!

@kimwnasptd Can you please share what all equivalent setting ( fs.inotify.max_user_{watches,instances} settings) we can do for Mac? Thanks.

Jan 06 '22 12:01 skothawa-tibco

@jimthompson5802 @skothawa-tibco I'm also using mac, Docker desktop and k3d. What worked for me was to open docker Preferences -> Docker Engine and add to config:

  "default-ulimits": {
    "nofile": {
      "Soft": 640000,
      "Hard": 640000,
      "Name": "nofile"
    }
  },

Which simply is 10x more than defaults from docker daemon configuration.

Restart the docker.

After killing all crashing pods they got created successfully.

Not sure if related, because I also made other change before rebooting docker. Check if your mysql pod is failing due to too many files open or something else. Mine was crashing because of --initialize specified but the data directory has files in it error. I simply deleted both mysql PV, mysql PVC and recreated them from manifests.

Jan 10 '22 04:01 bartgras

I am using mac, docker desktop and KIND. @jimthompson5802 tried the above settings but no luck. The above setting are updated on both host machine and on docker daemon configuration. But still issue remains same. Can someone please look into this?

  "default-ulimits": {
    "nofile": {
      "Soft": 640000,
      "Hard": 640000,
      "Name": "nofile"
    }
  },

ulimit -a
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         0
-v: address space (kbytes)          unlimited
-l: locked-in-memory size (kbytes)  unlimited
-u: processes                       2048
-n: file descriptors                524288

Jan 11 '22 05:01 skothawa-tibco

@skothawa-tibco Out of curiosity, could you try exec in terminal launchctl limit maxfiles 200000, restart docker, kill failing containers and see if that helps?

Jan 11 '22 06:01 bartgras

On the host machine terminal we can see below values:

launchctl limit maxfiles
	maxfiles    524288         5242880

After exec into worker node getting below error:

docker exec -it 85754ed34564 bash
bash-5.0# launchctl limit maxfiles
bash: launchctl: command not found
bash-5.0#

Below are the running containers of KIND: docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES

85754ed34564   kindest/node:v1.20.7                   "/usr/local/bin/entr…"   32 minutes ago   Up 32 minutes                                                                        worker
f072d9bc99b1   kindest/node:v1.20.7                   "/usr/local/bin/entr…"   32 minutes ago   Up 32 minutes   0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp, 127.0.0.1:6443->6443/tcp   control-plane
5800621e01fd   rpardini/docker-registry-proxy:0.6.3   "/entrypoint.sh"         32 minutes ago   Up 32 minutes   80/tcp, 3128/tcp, 8081-8082/tcp                                      registry-proxy

ulimit values inside worker node:

docker exec -it 85754ed34564 bash
root@tibco-cic-worker:/# ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 95734
max locked memory           (kbytes, -l) 64
max memory size             (kbytes, -m) unlimited
open files                          (-n) 640000
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) unlimited
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

OS details: macOS Monterey 12.1 version

@bartgras We are already having the greater values than you suggested. Let me know if any other pointers that can be tried out.

Jan 11 '22 07:01 skothawa-tibco

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions.

Apr 16 '22 23:04 stale[bot]

@jimthompson5802 I've also seen this happening in a KinD cluster I had, for the same Deployments. In my case I mitigated these errors by increasing my laptop's fs.inotify.max_user_{watches,instances} settings.

Not sure if this will also work for k3s though.

This worked for me too:

sudo sysctl fs.inotify.max_user_instances=1280
sudo sysctl fs.inotify.max_user_watches=655360

(10x previous values) solved this problem on k0s instance

Apr 18 '22 15:04 mstopa

@jimthompson5802 I've also seen this happening in a KinD cluster I had, for the same Deployments. In my case I mitigated these errors by increasing my laptop's fs.inotify.max_user_{watches,instances} settings. Not sure if this will also work for k3s though.

This worked for me too:
sudo sysctl fs.inotify.max_user_instances=1280
sudo sysctl fs.inotify.max_user_watches=655360
(10x previous values) solved this problem on k0s instance

Thanks for saving my time. I solved my issue with the above comands.

May 27 '22 04:05 minchang

I've hit this issue today while testing 1.6 on microk8s. The pods affected were: katib-controller, kubeflow-profiles, kfp-api and kfp-persistence.

@mstopa 's workaround did fix it, but I'm wondering if we are doing something wrong in these components for this to occur, could we possibly be more efficient in the way we lease API watchers?

Aug 04 '22 11:08 DomFleischmann

/close

There has been no activity for a long time. Please reopen if necessary.

Aug 24 '23 16:08 juliusvonkohout

@juliusvonkohout: Closing this issue.

In response to this:

/close

There has been no activity for a long time. Please reopen if necessary.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 24 '23 16:08 google-oss-prow[bot]

manifests manifests copied to clipboard

Several pods do not start, encounter "too many open files" error

Log extract from failed pods

Platform

Software Versions:

k3d cluster nodes

ulimit for the two nodes

manifests
manifests copied to clipboard