manifests
manifests copied to clipboard
Several pods do not start, encounter "too many open files" error
In setting up a kubeflow cluster using the master
branch at commit 3dad839f
. Four pods encounter too many open files error.
For the k8s cluster, I'm using a local k3d
cluster on MacOS (11.6.1): https://k3d.io
At end of deploying kubeflow these are the status of 4 pods.
kubectl get pod -A | grep -v Run | grep -v NAME
kubeflow ml-pipeline-8c4b99589-gcvmz 1/2 CrashLoopBackOff 15 63m
kubeflow kfserving-controller-manager-0 1/2 CrashLoopBackOff 15 63m
kubeflow profiles-deployment-89f7d88b-hp697 1/2 CrashLoopBackOff 15 63m
kubeflow katib-controller-68c47fbf8b-d6mpj 0/1 CrashLoopBackOff 16 63m
The cluster has been torn down and rebuilt several times. Each time the same 4 pods encounter the too many open files error. All other pods successfully attain Running
status.
According to ulimit -n
on the nodes, the nodes have a very high setting for that limit: 1048576. Since this is run on MacOS, configured launchctl
to increase the maxfiles from 256 to 524288.
I'm new to kubeflow, so any guidance offered will be appreciated.
Following are the diagnostic data collected:
Log extract from failed pods
kubectl logs ml-pipeline-8c4b99589-gcvmz
Using deprecated annotation `kubectl.kubernetes.io/default-logs-container` in pod/ml-pipeline-8c4b99589-gcvmz. Please use `kubectl.kubernetes.io/default-container` instead
2021/12/11 13:01:59 too many open files
kubectl logs kfserving-controller-manager-0 -c manager
<<<< deleted info level messages>>>>
{"level":"error","ts":1639227716.1910038,"logger":"controller-runtime.manager","msg":"error received after stop sequence was engaged","error":"Timeout: failed waiting for *v1beta1.InferenceService Informer to sync","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:547"}
{"level":"error","ts":1639227716.1911373,"logger":"controller-runtime.manager","msg":"error received after stop sequence was engaged","error":"Timeout: failed waiting for *v1alpha1.TrainedModel Informer to sync","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:547"}
{"level":"error","ts":1639227716.1912212,"logger":"entrypoint","msg":"unable to run the manager","error":"too many open files","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nmain.main\n\t/go/src/github.com/kubeflow/kfserving/cmd/manager/main.go:183\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}
kubectl logs profiles-deployment-89f7d88b-hp697 -c manager
I1211 13:02:40.188855 1 request.go:645] Throttling request took 1.036224909s, request: GET:https://10.43.0.1:443/apis/flows.knative.dev/v1?timeout=32s
2021-12-11T13:02:41.646Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": ":8080"}
2021-12-11T13:02:41.646Z ERROR setup unable to create controller {"controller": "Profile", "error": "Failed to start file watcher: too many open files", "errorVerbose": "too many open files\nFailed to start file watcher\ngithub.com/kubeflow/kubeflow/components/profile-controller/controllers.(*ProfileReconciler).SetupWithManager\n\t/workspace/controllers/profile_controller.go:381\nmain.main\n\t/workspace/main.go:93\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374"}
runtime.main
/usr/local/go/src/runtime/proc.go:204
kubectl logs katib-controller-68c47fbf8b-d6mpj
<<<<<<<< removed info level messages >>>>>>>>>>>>
{"level":"error","ts":1639227826.322595,"logger":"controller-runtime.manager","msg":"error received after stop sequence was engaged","error":"Timeout: failed waiting for *v1beta1.Suggestion Informer to sync","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:529"}
{"level":"error","ts":1639227826.3227415,"logger":"controller-runtime.manager","msg":"error received after stop sequence was engaged","error":"Timeout: failed waiting for *v1beta1.Experiment Informer to sync","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:529"}
{"level":"error","ts":1639227826.32281,"logger":"entrypoint","msg":"Unable to run the manager","error":"too many open files","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nmain.main\n\t/go/src/github.com/kubeflow/katib/cmd/katib-controller/v1beta1/main.go:128\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:255"}
kubeflow deployed using kustomize build ${component} | kubectl apply -f -
on each of the following compnonents in the order shown:
# cert manager
common/cert-manager/cert-manager/base \
common/cert-manager/kubeflow-issuer/base \
# istio
common/istio-1-9/istio-crds/base \
common/istio-1-9/istio-namespace/base \
common/istio-1-9/istio-install/base \
#DEX
common/dex/overlays/istio \
# OIDC Auth Service
common/oidc-authservice/base \
# knative serving
common/knative/knative-serving/base \
common/istio-1-9/cluster-local-gateway/base \
# inference event logging
common/knative/knative-eventing/base \
# kubeflow namespace
common/kubeflow-namespace/base \
# kubeflow roles
common/kubeflow-roles/base \
# kubeflow istio resources
common/istio-1-9/kubeflow-istio-resources/base \
# kubeflow pipelines
apps/pipeline/upstream/env/platform-agnostic-multi-user-pns \
# KFServing
apps/kfserving/upstream/overlays/kubeflow \
# Katib
apps/katib/upstream/installs/katib-with-kubeflow \
# Central Dashboard
apps/centraldashboard/upstream/overlays/istio \
# Admission Controler
apps/admission-webhook/upstream/overlays/cert-manager \
# Notebooks
apps/jupyter/notebook-controller/upstream/overlays/kubeflow \
# Jupyter web app
apps/jupyter/jupyter-web-app/upstream/overlays/istio \
# Profiles + KFAM
apps/profiles/upstream/overlays/kubeflow \
# Volumes Web app
apps/volumes-web-app/upstream/overlays/istio \
# Tensorboard
apps/tensorboard/tensorboards-web-app/upstream/overlays/istio \
# Training Operator
apps/training-operator/upstream/overlays/kubeflow \
# User Namespace
common/user-namespace/base \
Platform
MacOS: 11.6.1
MacBookPro 2019 (Intel), 16GB RAM
Software Versions:
k3d version
k3d version v5.1.0
k3s version v1.21.5-k3s2 (default)
docker version
Client:
Cloud integration: v1.0.22
Version: 20.10.11
API version: 1.41
Go version: go1.16.10
Git commit: dea9396
Built: Thu Nov 18 00:36:09 2021
OS/Arch: darwin/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.11
API version: 1.41 (minimum version 1.12)
Go version: go1.16.9
Git commit: 847da18
Built: Thu Nov 18 00:35:39 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.12
GitCommit: 7b11cfaabd73bb80907dd23182b9347b4245eb5d
runc:
Version: 1.0.2
GitCommit: v1.0.2-0-g52b36a2
docker-init:
Version: 0.19.0
GitCommit: de40ad0
kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:48:33Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5+k3s2", GitCommit:"724ef700bab896ff252a75e2be996d5f4ff1b842", GitTreeState:"clean", BuildDate:"2021-10-05T19:59:14Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
kustomize version
Version: {KustomizeVersion:3.2.0 GitCommit:a3103f1e62ddb5b696daa3fd359bb6f2e8333b49 BuildDate:2019-09-18T16:26:36Z GoOs:darwin GoArch:amd64}
k3d cluster nodes
kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k3d-kubeflow-server-0 Ready control-plane,master 79m v1.21.5+k3s2 172.19.0.2 <none> Unknown 5.10.76-linuxkit containerd://1.4.11-k3s1
k3d-kubeflow-agent-0 Ready <none> 79m v1.21.5+k3s2 172.19.0.3 <none> Unknown 5.10.76-linuxkit containerd://1.4.11-k3s1
ulimit for the two nodes
ulimit -a # on server node
core file size (blocks) (-c) 0
data seg size (kb) (-d) unlimited
scheduling priority (-e) 0
file size (blocks) (-f) unlimited
pending signals (-i) 51481
max locked memory (kb) (-l) 64
max memory size (kb) (-m) unlimited
open files (-n) 1048576
POSIX message queues (bytes) (-q) 819200
real-time priority (-r) 0
stack size (kb) (-s) 8192
cpu time (seconds) (-t) unlimited
max user processes (-u) unlimited
virtual memory (kb) (-v) unlimited
file locks (-x) unlimited
ulimit -a # on worker node
core file size (blocks) (-c) 0
data seg size (kb) (-d) unlimited
scheduling priority (-e) 0
file size (blocks) (-f) unlimited
pending signals (-i) 51481
max locked memory (kb) (-l) 64
max memory size (kb) (-m) unlimited
open files (-n) 1048576
POSIX message queues (bytes) (-q) 819200
real-time priority (-r) 0
stack size (kb) (-s) 8192
cpu time (seconds) (-t) unlimited
max user processes (-u) unlimited
virtual memory (kb) (-v) unlimited
file locks (-x) unlimited
@jimthompson5802 I've also seen this happening in a KinD cluster I had, for the same Deployments. In my case I mitigated these errors by increasing my laptop's fs.inotify.max_user_{watches,instances}
settings.
Not sure if this will also work for k3s though.
@kimwnasptd thank you for the suggestion.
In my case I mitigated these errors by increasing my laptop's
fs.inotify.max_user_{watches,instances}
settings.
I just want to confirm the parameter names cited are from Linux. If this is correct, then I believe the equivalent parameter in MacOS are these
$ sysctl -a | grep "kern.maxfiles"
kern.maxfiles: 16777216
kern.maxfilesperproc: 524288
My belief on the parameter names come for this posting.
If this is the case, then the change did not seem to work. What values did you use to get KinD
to work.
Again, thank you for taking the time to respond to my question.
I am also facing similar issue on KIND. some of the pods are going to crashloopbackoff state. Error is as below:
Error starting filewatcher: 'too many open files'. Configuration changes will not be detected!
@kimwnasptd Can you please share what all equivalent setting ( fs.inotify.max_user_{watches,instances} settings) we can do for Mac? Thanks.
@jimthompson5802 @skothawa-tibco
I'm also using mac, Docker desktop
and k3d
.
What worked for me was to open docker Preferences
-> Docker Engine
and add to config:
"default-ulimits": {
"nofile": {
"Soft": 640000,
"Hard": 640000,
"Name": "nofile"
}
},
Which simply is 10x more than defaults from docker daemon configuration.
Restart the docker.
After killing all crashing pods they got created successfully.
Not sure if related, because I also made other change before rebooting docker. Check if your mysql pod is failing due to too many files open
or something else. Mine was crashing because of --initialize specified but the data directory has files in it
error. I simply deleted both mysql PV, mysql PVC and recreated them from manifests.
I am using mac, docker desktop and KIND. @jimthompson5802 tried the above settings but no luck. The above setting are updated on both host machine and on docker daemon configuration. But still issue remains same. Can someone please look into this?
"default-ulimits": {
"nofile": {
"Soft": 640000,
"Hard": 640000,
"Name": "nofile"
}
},
ulimit -a
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) 8192
-c: core file size (blocks) 0
-v: address space (kbytes) unlimited
-l: locked-in-memory size (kbytes) unlimited
-u: processes 2048
-n: file descriptors 524288
@skothawa-tibco
Out of curiosity, could you try exec in terminal launchctl limit maxfiles 200000
, restart docker, kill failing containers and see if that helps?
On the host machine terminal we can see below values:
launchctl limit maxfiles
maxfiles 524288 5242880
After exec into worker node getting below error:
docker exec -it 85754ed34564 bash
bash-5.0# launchctl limit maxfiles
bash: launchctl: command not found
bash-5.0#
Below are the running containers of KIND: docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
85754ed34564 kindest/node:v1.20.7 "/usr/local/bin/entr…" 32 minutes ago Up 32 minutes worker
f072d9bc99b1 kindest/node:v1.20.7 "/usr/local/bin/entr…" 32 minutes ago Up 32 minutes 0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp, 127.0.0.1:6443->6443/tcp control-plane
5800621e01fd rpardini/docker-registry-proxy:0.6.3 "/entrypoint.sh" 32 minutes ago Up 32 minutes 80/tcp, 3128/tcp, 8081-8082/tcp registry-proxy
ulimit values inside worker node:
docker exec -it 85754ed34564 bash
root@tibco-cic-worker:/# ulimit -a
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 95734
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 640000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
OS details: macOS Monterey 12.1 version
@bartgras We are already having the greater values than you suggested. Let me know if any other pointers that can be tried out.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions.
@jimthompson5802 I've also seen this happening in a KinD cluster I had, for the same Deployments. In my case I mitigated these errors by increasing my laptop's
fs.inotify.max_user_{watches,instances}
settings.Not sure if this will also work for k3s though.
This worked for me too:
sudo sysctl fs.inotify.max_user_instances=1280
sudo sysctl fs.inotify.max_user_watches=655360
(10x previous values) solved this problem on k0s instance
@jimthompson5802 I've also seen this happening in a KinD cluster I had, for the same Deployments. In my case I mitigated these errors by increasing my laptop's
fs.inotify.max_user_{watches,instances}
settings. Not sure if this will also work for k3s though.This worked for me too:
sudo sysctl fs.inotify.max_user_instances=1280 sudo sysctl fs.inotify.max_user_watches=655360
(10x previous values) solved this problem on k0s instance
Thanks for saving my time. I solved my issue with the above comands.
I've hit this issue today while testing 1.6 on microk8s. The pods affected were: katib-controller, kubeflow-profiles, kfp-api and kfp-persistence.
@mstopa 's workaround did fix it, but I'm wondering if we are doing something wrong in these components for this to occur, could we possibly be more efficient in the way we lease API watchers?
/close
There has been no activity for a long time. Please reopen if necessary.
@juliusvonkohout: Closing this issue.
In response to this:
/close
There has been no activity for a long time. Please reopen if necessary.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.