argo-events
argo-events copied to clipboard
too many open files - Error
Describe the bug When a Sensor starts it always throws the following error and then exits with ExitCode 1
{"level":"info","ts":1648741781.4416764,"logger":"argo-events.sensor","caller":"cmd/start.go:73","msg":"starting sensor server","sensorName":"kafka","version":"v1.6.0"}
{"level":"info","ts":1648741781.4422603,"logger":"argo-events.sensor","caller":"metrics/metrics.go:172","msg":"starting metrics server","sensorName":"kafka"}
2022/03/31 15:49:41 too many open files
Unfortunately there are no additional information. I already looked at the nodes where it's running on for an exhaustion of the file descriptors, but everything is looking good there.
When the Sensor runs on a fresh node though it's working fine. But we can't always start fresh nodes and the affected nodes are fine regarding overall resource utilization.
To Reproduce Steps to reproduce the behavior:
- Start any sensor with a kafka EventSource (I did not test if it also happens with other sources)
Expected behavior It starts up normally.
Environment (please complete the following information):
- Kubernetes: [e.g. v1.19.15-eks-9c63c4]
- Argo: v3.2.9
- Argo Events: 1.6.0
Message from the maintainers:
If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.
Can you please help us by identifying the open file handles? This will help us diagnose this issue. I think you can run lsof
to list open file handles. You will need to use a ephemeral container to do this. I do not know if EKS supports this yet (AWS are slow on upgrades).
Note to self - looks like it does not start-up correctly, metrics server starts and then we get "too many open files".
@mrkwtz - Can you check the node's max allowed open files - cat /proc/sys/fs/file-max
, also check the current open files - lsof | wc -l
?
Can you please help us by identifying the open file handles? This will help us diagnose this issue. I think you can run
lsof
to list open file handles. You will need to use a ephemeral container to do this. I do not know if EKS supports this yet (AWS are slow on upgrades).
Yeah, it seems like even the latest EKS version doesn't support this feature yet. Do you have another idea how to debug that?
Is there a way to login the EKS node?
This issue has been automatically marked as stale because it has not had any activity in the last 60 days. It will be closed if no further activity occurs. Thank you for your contributions.
@whynowy @alexec I've experienced this bug when tried to switch our EKS nodes from AL2
to Bottlerocket
.
Execing into random pod and issuing ulimit -n
gives me:
-
1048576
onAL2
instance -
65536
onBottlerocket
instance
If I dive into the Bottlerocket
instance and issue containerd config dump
i see this part:
[plugins."io.containerd.grpc.v1.cri"]
...
process_rlimit_no_file_hard = 1048576
process_rlimit_no_file_soft = 65536
So as I understand this, the process inside a pod can extend it's soft
limits up to the hard
level. And sensor pod just doesn't do that.
@vadimgusev-codefresh - Is this something can be done in the Dockerfile?
I am also hitting this issue. I see this on sensors and event sources. Rescheduling onto a 'fresh' box does seem to help for a while, but it does eventually start failing. I see this with sqs and webhook (the two that I have tried). The log:
{"level":"info","ts":1659067244.7226589,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"sqs-argo-event-source","version":"v1.7.1"}
{"level":"info","ts":1659067244.7227285,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:403","msg":"Starting event source server...","eventSourceName":"sqs-argo-event-source"}
{"level":"info","ts":1659067244.7228265,"logger":"argo-events.eventsource","caller":"metrics/metrics.go:172","msg":"starting metrics server","eventSourceName":"sqs-argo-event-source"}
2022/07/29 04:00:44 too many open files
From the node:
root@box:/# lsof |wc -l
534122
root@box:/# cat /proc/sys/fs/file-max
9223372036854775807
I currently cannot attach an ephemeral container because the pod is in 'Error' - it cannot start the main container... just keeps crashing. After forcing it to reschedule on a different node, here is what the ephemeral container says once it has started up:
/ # cat /proc/sys/fs/file-max
9223372036854775807
/ # lsof |wc -l
21
I am also hitting this issue. I see this on sensors and event sources. Rescheduling onto a 'fresh' box does seem to help for a while, but it does eventually start failing. I see this with sqs and webhook (the two that I have tried). The log:
{"level":"info","ts":1659067244.7226589,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"sqs-argo-event-source","version":"v1.7.1"} {"level":"info","ts":1659067244.7227285,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:403","msg":"Starting event source server...","eventSourceName":"sqs-argo-event-source"} {"level":"info","ts":1659067244.7228265,"logger":"argo-events.eventsource","caller":"metrics/metrics.go:172","msg":"starting metrics server","eventSourceName":"sqs-argo-event-source"} 2022/07/29 04:00:44 too many open files
From the node:
root@box:/# lsof |wc -l 534122 root@box:/# cat /proc/sys/fs/file-max 9223372036854775807
I currently cannot attach an ephemeral container because the pod is in 'Error' - it cannot start the main container... just keeps crashing. After forcing it to reschedule on a different node, here is what the ephemeral container says once it has started up:
/ # cat /proc/sys/fs/file-max 9223372036854775807 / # lsof |wc -l 21
Thanks for sharing all of these!
In EventSource
and Sensor
spec, there's a way to configure pod SecurityContext
:
spec:
template:
securityContext:
sysctls:
- name: fs.file-max
value: "your-value"
Try it out.
In
EventSource
andSensor
spec, there's a way to configure podSecurityContext
:spec: template: securityContext: sysctls: - name: fs.file-max value: "your-value"
Try it out.
To my knowledge, fs.file-max is not namespaced, and so cannot be set this way.
We finally are able to ssh into our nodes. Unfortunately because we're out if inotify user watches / instances (don't know yet) we're not :D
It turns out the fs.inotify.max_user_instances setting is pretty low on our nodes although it should be higher. You can take a look here https://github.com/awslabs/amazon-eks-ami/issues/1065 So first we're going to investigate why that's the case and if the problem then still persists I'll post again.
We found the culprit. The privileged promtail pods (<= v3.0.3) are setting the fs.inotify.max_user_instances to 128. We'll upgrade to at least v3.0.4 and then that error should be gone. Closed for now.
Thank you for your patience with us :)
@mrkwtz that's an interesting finding! We are also using loki/promtail, so might affect us also, will investigate!
@vadimgusev-codefresh See here, https://github.com/grafana/helm-charts/commit/bd6e0805e0b14e0d5a39fc9b0dada0ee0042ed90
Nice! @mrkwtz
The issue still exists for us. We're running argo-events 1.8.0
, argo-cd 2.6.7
, Promtail
helm version 6.11.3
app version 2.8.2
.
The nodes look healthy and the issues appear randomly in sensors and/or event sources.
{"level":"info","ts":1686750556.961388,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"web-apps","version":"v1.8.0"}
2
{"level":"info","ts":1686750556.9620998,"logger":"argo-events.eventsource","caller":"metrics/metrics.go:175","msg":"starting metrics server","eventSourceName":"web-apps"}
1
2023/06/14 13:49:16 too many open files
From the node I get:
/ # cat /proc/sys/fs/file-max
9223372036854775807
/ # lsof |wc -l
5410
/ # lsof |wc -l
5411
/ # lsof |wc -l
5412
/ # lsof |wc -l
5410
The issue appeared some weeks ago out of the sudden and now randomly comes and goes.
I added a fresh node pool and moved about 30 sensors to the empty node and got the same error in some of the sensors. Also the sensor pod never auto recovers. After deleting the pod multiple times, it could be that the pod starts properly, but often it doesn't.
If multiple sensors are created at the same time the chance for them to fail seams higher. Also if there are more sensors on the same node the chance to fail seams higher.
The logs differ sometimes. Occasionally it is:
2023/06/15 08:20:39 too many open files │
│ {"level":"info","ts":1686817239.5714653,"logger":"argo-events.sensor","caller":"cmd/start.go:84","msg":"starting sensor server","sensorName":"pengine-test","version":"v1.8.0"} │
│ Stream closed EOF for argo/pengine-test-sensor-jt82s-7b84f7b8c7-z28hs (main)
So metrics isn't necessarily started.
I think this ticket should be reopened, in this state argo events isn't working reliably at all.
I finally found a way to fix it.
Execute on node(s) where sensors/event sources are deployed:
sysctl fs.inotify.max_user_instances=1280
sysctl fs.inotify.max_user_watches=655360
https://github.com/kubeflow/manifests/issues/2087#issuecomment-1101482095
I used the inotify-consumers script to check the max_user_instances and max_user_watches.
The default values on Ubuntu 22.04.2 LTS at the time of this writing.
- fs.inotify.max_user_instances=128
- fs.inotify.max_user_watches=249493
The result of the script shows user_watches < 1000 so we are nowhere near the limit, but user_instances is 127 for root. I increased this to 256 and confirmed the issue is resolved after restarting the sensor replicaset. I ran sudo sysctl fs.inotify.max_user_instances=256
on all worker nodes.
Why is this closed. Issue still exist with argo-events 1.8.1. All above ideas to check the nodes wher not successfull.
{"level":"info","ts":1695890455.5424418,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"webhook","version":"v1.8.1"}
{"level":"info","ts":1695890455.5425568,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:443","msg":"Starting event source server...","eventSourceName":"webhook"}
ERROR 2023/09/28 08:40:55 failed to create watcher: too many open files
Further, even if it works it has to be set manually in the nodes. Which makes this issue still a big pain.