argo-events icon indicating copy to clipboard operation
argo-events copied to clipboard

too many open files - Error

Open mrkwtz opened this issue 2 years ago • 12 comments

Describe the bug When a Sensor starts it always throws the following error and then exits with ExitCode 1

{"level":"info","ts":1648741781.4416764,"logger":"argo-events.sensor","caller":"cmd/start.go:73","msg":"starting sensor server","sensorName":"kafka","version":"v1.6.0"}                                          
{"level":"info","ts":1648741781.4422603,"logger":"argo-events.sensor","caller":"metrics/metrics.go:172","msg":"starting metrics server","sensorName":"kafka"}                                                     
2022/03/31 15:49:41 too many open files   

Unfortunately there are no additional information. I already looked at the nodes where it's running on for an exhaustion of the file descriptors, but everything is looking good there.

When the Sensor runs on a fresh node though it's working fine. But we can't always start fresh nodes and the affected nodes are fine regarding overall resource utilization.

To Reproduce Steps to reproduce the behavior:

  1. Start any sensor with a kafka EventSource (I did not test if it also happens with other sources)

Expected behavior It starts up normally.

Environment (please complete the following information):

  • Kubernetes: [e.g. v1.19.15-eks-9c63c4]
  • Argo: v3.2.9
  • Argo Events: 1.6.0

Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

mrkwtz avatar Mar 31 '22 15:03 mrkwtz

Can you please help us by identifying the open file handles? This will help us diagnose this issue. I think you can run lsof to list open file handles. You will need to use a ephemeral container to do this. I do not know if EKS supports this yet (AWS are slow on upgrades).

alexec avatar Mar 31 '22 16:03 alexec

Note to self - looks like it does not start-up correctly, metrics server starts and then we get "too many open files".

alexec avatar Mar 31 '22 16:03 alexec

@mrkwtz - Can you check the node's max allowed open files - cat /proc/sys/fs/file-max, also check the current open files - lsof | wc -l?

whynowy avatar Mar 31 '22 19:03 whynowy

Can you please help us by identifying the open file handles? This will help us diagnose this issue. I think you can run lsof to list open file handles. You will need to use a ephemeral container to do this. I do not know if EKS supports this yet (AWS are slow on upgrades).

Yeah, it seems like even the latest EKS version doesn't support this feature yet. Do you have another idea how to debug that?

mrkwtz avatar Apr 01 '22 11:04 mrkwtz

Is there a way to login the EKS node?

whynowy avatar Apr 04 '22 16:04 whynowy

This issue has been automatically marked as stale because it has not had any activity in the last 60 days. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jun 04 '22 02:06 github-actions[bot]

@whynowy @alexec I've experienced this bug when tried to switch our EKS nodes from AL2 to Bottlerocket. Execing into random pod and issuing ulimit -n gives me:

  • 1048576 on AL2 instance
  • 65536 on Bottlerocket instance

If I dive into the Bottlerocket instance and issue containerd config dump i see this part:

  [plugins."io.containerd.grpc.v1.cri"]
    ...
    process_rlimit_no_file_hard = 1048576
    process_rlimit_no_file_soft = 65536

So as I understand this, the process inside a pod can extend it's soft limits up to the hard level. And sensor pod just doesn't do that.

vadimgusev-codefresh avatar Jun 09 '22 10:06 vadimgusev-codefresh

@vadimgusev-codefresh - Is this something can be done in the Dockerfile?

whynowy avatar Jun 10 '22 06:06 whynowy

I am also hitting this issue. I see this on sensors and event sources. Rescheduling onto a 'fresh' box does seem to help for a while, but it does eventually start failing. I see this with sqs and webhook (the two that I have tried). The log:

{"level":"info","ts":1659067244.7226589,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"sqs-argo-event-source","version":"v1.7.1"}
{"level":"info","ts":1659067244.7227285,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:403","msg":"Starting event source server...","eventSourceName":"sqs-argo-event-source"}
{"level":"info","ts":1659067244.7228265,"logger":"argo-events.eventsource","caller":"metrics/metrics.go:172","msg":"starting metrics server","eventSourceName":"sqs-argo-event-source"}
2022/07/29 04:00:44 too many open files

From the node:

root@box:/# lsof |wc -l
534122
root@box:/# cat /proc/sys/fs/file-max
9223372036854775807

I currently cannot attach an ephemeral container because the pod is in 'Error' - it cannot start the main container... just keeps crashing. After forcing it to reschedule on a different node, here is what the ephemeral container says once it has started up:

/ # cat /proc/sys/fs/file-max
9223372036854775807
/ # lsof |wc -l
21

drpebcak avatar Jul 29 '22 04:07 drpebcak

I am also hitting this issue. I see this on sensors and event sources. Rescheduling onto a 'fresh' box does seem to help for a while, but it does eventually start failing. I see this with sqs and webhook (the two that I have tried). The log:

{"level":"info","ts":1659067244.7226589,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"sqs-argo-event-source","version":"v1.7.1"}
{"level":"info","ts":1659067244.7227285,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:403","msg":"Starting event source server...","eventSourceName":"sqs-argo-event-source"}
{"level":"info","ts":1659067244.7228265,"logger":"argo-events.eventsource","caller":"metrics/metrics.go:172","msg":"starting metrics server","eventSourceName":"sqs-argo-event-source"}
2022/07/29 04:00:44 too many open files

From the node:

root@box:/# lsof |wc -l
534122
root@box:/# cat /proc/sys/fs/file-max
9223372036854775807

I currently cannot attach an ephemeral container because the pod is in 'Error' - it cannot start the main container... just keeps crashing. After forcing it to reschedule on a different node, here is what the ephemeral container says once it has started up:

/ # cat /proc/sys/fs/file-max
9223372036854775807
/ # lsof |wc -l
21

Thanks for sharing all of these!

whynowy avatar Jul 29 '22 07:07 whynowy

In EventSource and Sensor spec, there's a way to configure pod SecurityContext:

spec:
  template:
    securityContext:
      sysctls:
      - name: fs.file-max
        value: "your-value"

Try it out.

whynowy avatar Jul 29 '22 07:07 whynowy

In EventSource and Sensor spec, there's a way to configure pod SecurityContext:


spec:

  template:

    securityContext:

      sysctls:

      - name: fs.file-max

        value: "your-value"

Try it out.

To my knowledge, fs.file-max is not namespaced, and so cannot be set this way.

drpebcak avatar Jul 29 '22 07:07 drpebcak

We finally are able to ssh into our nodes. Unfortunately because we're out if inotify user watches / instances (don't know yet) we're not :D

It turns out the fs.inotify.max_user_instances setting is pretty low on our nodes although it should be higher. You can take a look here https://github.com/awslabs/amazon-eks-ami/issues/1065 So first we're going to investigate why that's the case and if the problem then still persists I'll post again.

mrkwtz avatar Oct 28 '22 10:10 mrkwtz

We found the culprit. The privileged promtail pods (<= v3.0.3) are setting the fs.inotify.max_user_instances to 128. We'll upgrade to at least v3.0.4 and then that error should be gone. Closed for now.

Thank you for your patience with us :)

mrkwtz avatar Nov 02 '22 08:11 mrkwtz

@mrkwtz that's an interesting finding! We are also using loki/promtail, so might affect us also, will investigate!

vadimgusev-codefresh avatar Nov 02 '22 08:11 vadimgusev-codefresh

@vadimgusev-codefresh See here, https://github.com/grafana/helm-charts/commit/bd6e0805e0b14e0d5a39fc9b0dada0ee0042ed90

mrkwtz avatar Nov 02 '22 10:11 mrkwtz

Nice! @mrkwtz

whynowy avatar Nov 03 '22 01:11 whynowy

The issue still exists for us. We're running argo-events 1.8.0, argo-cd 2.6.7, Promtail helm version 6.11.3 app version 2.8.2. The nodes look healthy and the issues appear randomly in sensors and/or event sources.

{"level":"info","ts":1686750556.961388,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"web-apps","version":"v1.8.0"}
2
{"level":"info","ts":1686750556.9620998,"logger":"argo-events.eventsource","caller":"metrics/metrics.go:175","msg":"starting metrics server","eventSourceName":"web-apps"}
1
2023/06/14 13:49:16 too many open files

From the node I get:

/ # cat /proc/sys/fs/file-max
9223372036854775807
/ # lsof |wc -l
5410
/ # lsof |wc -l
5411
/ # lsof |wc -l
5412
/ # lsof |wc -l
5410

The issue appeared some weeks ago out of the sudden and now randomly comes and goes.

nice-pink avatar Jun 14 '23 13:06 nice-pink

I added a fresh node pool and moved about 30 sensors to the empty node and got the same error in some of the sensors. Also the sensor pod never auto recovers. After deleting the pod multiple times, it could be that the pod starts properly, but often it doesn't.

If multiple sensors are created at the same time the chance for them to fail seams higher. Also if there are more sensors on the same node the chance to fail seams higher.

The logs differ sometimes. Occasionally it is:

2023/06/15 08:20:39 too many open files                                                                                                                                                                                                                                               │
│ {"level":"info","ts":1686817239.5714653,"logger":"argo-events.sensor","caller":"cmd/start.go:84","msg":"starting sensor server","sensorName":"pengine-test","version":"v1.8.0"}                                                                                                       │
│ Stream closed EOF for argo/pengine-test-sensor-jt82s-7b84f7b8c7-z28hs (main)  

So metrics isn't necessarily started.

I think this ticket should be reopened, in this state argo events isn't working reliably at all.

nice-pink avatar Jun 15 '23 08:06 nice-pink

I finally found a way to fix it.

Execute on node(s) where sensors/event sources are deployed:

sysctl fs.inotify.max_user_instances=1280
sysctl fs.inotify.max_user_watches=655360

https://github.com/kubeflow/manifests/issues/2087#issuecomment-1101482095

nice-pink avatar Jun 16 '23 09:06 nice-pink

I used the inotify-consumers script to check the max_user_instances and max_user_watches.

The default values on Ubuntu 22.04.2 LTS at the time of this writing.

  • fs.inotify.max_user_instances=128
  • fs.inotify.max_user_watches=249493

The result of the script shows user_watches < 1000 so we are nowhere near the limit, but user_instances is 127 for root. I increased this to 256 and confirmed the issue is resolved after restarting the sensor replicaset. I ran sudo sysctl fs.inotify.max_user_instances=256 on all worker nodes.

jameshearttech avatar Aug 04 '23 20:08 jameshearttech

Why is this closed. Issue still exist with argo-events 1.8.1. All above ideas to check the nodes wher not successfull.

{"level":"info","ts":1695890455.5424418,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"webhook","version":"v1.8.1"}
{"level":"info","ts":1695890455.5425568,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:443","msg":"Starting event source server...","eventSourceName":"webhook"}
ERROR 2023/09/28 08:40:55 failed to create watcher: too many open files

AlexanderWiechert avatar Sep 28 '23 08:09 AlexanderWiechert

Further, even if it works it has to be set manually in the nodes. Which makes this issue still a big pain.

nice-pink avatar Sep 28 '23 08:09 nice-pink