fedmsg icon indicating copy to clipboard operation
fedmsg copied to clipboard

Leaking file descriptors?

Open msimacek opened this issue 7 years ago • 13 comments

In koschei we have a simple tail_messages loop. Recently it started dying (approximately every hour) with:

IOError: [Errno 24] Too many open files: '/etc/pki/fedmsg/ca.crt'

Looking at /proc/PID/fd when it's still running, it contains mostly the following repeated many times:

lrwx------. 1 koschei koschei 64 Nov 30 18:19 27 -> socket:[3118780725]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 270 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 271 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 272 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 273 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 274 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 275 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 276 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 277 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 278 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 279 -> anon_inode:[eventfd]

CC @mizdebsk

msimacek avatar Dec 01 '17 08:12 msimacek

Out of curiosity, which fedmsg version do you run?

pypingou avatar Dec 01 '17 09:12 pypingou

fedmsg-0.19.1-1.fc26.noarch

msimacek avatar Dec 01 '17 09:12 msimacek

That seems old, I though infra repos always backported the current version

msimacek avatar Dec 01 '17 09:12 msimacek

@msimacek not if the normal stable repos (like, in this case f26) contain newer.

puiterwijk avatar Dec 01 '17 09:12 puiterwijk

Is this still present in the latest release?

jeremycline avatar Dec 07 '17 19:12 jeremycline

Yes, it is still reproducible. In 2 hours it leaked several hundreds file descriptors.

[root@koschei-backend01 ~][PROD]# date
Fri Dec  8 10:54:04 UTC 2017
[root@koschei-backend01 ~][PROD]# ps -p 7093 -wo pid,lstart,cmd
  PID                  STARTED CMD
 7093 Fri Dec  8 08:33:28 2017 /usr/libexec/koschei/koschei-watcher -m koschei.backend.main watcher
[root@koschei-backend01 ~][PROD]# ls /proc/7093/fd | wc -l
629
[root@koschei-backend01 ~][PROD]# rpm -q fedmsg
fedmsg-1.0.1-4.fc26.noarch

mizdebsk avatar Dec 08 '17 10:12 mizdebsk

Interestingly, fedmsg-tail-3 opens only 11 fds when ran locally, but it immediately has ~250 of them open when I run it from koschei-backend01.stg. fedmsg-tail-2 is much worse - 12 fds locally vs ~600 when ran from koschei-backend01 (prod). Could this be a problem with our fedmsg config (/etc/fedmsg.d/) in Fedora infra?

mizdebsk avatar Dec 08 '17 11:12 mizdebsk

Ahh, I realized https://github.com/fedora-infra/fedmsg/pull/484 never made it into a release. It rewrites the CRL and CA loading. I'm guessing that will fix this problem (assuming what's leaking is the CRL/CA, which wouldn't surprise me). If it's easy for you to try the tip of the current development branch, that'd be great to confirm.

Interestingly, fedmsg-tail-3 opens only 11 fds when ran locally, but it immediately has ~250 of them open when I run it from koschei-backend01.stg.

I'm guessing if you counted up the "endpoints" in the various configurations it'd come out to 230 or something. zmq will open a TCP connection to each of those. I'm surprised there's such a big difference between Python 2 and 3, though. Off the top of my head I can't think of what would account for that.

jeremycline avatar Dec 08 '17 15:12 jeremycline

We have upgraded our production to Fedora 27, Python 3 and fedmsg-1.1.0-1.fc27.noarch. The problem still occurs.

msimacek avatar Feb 09 '18 09:02 msimacek

Can you provide a minimal reproducer + config? Figuring out when this started would be helpful, as well. Does it occur using m2crypto for validation, or only for cryptography+pyopenssl? Only in Python 3?

jeremycline avatar Feb 20 '18 20:02 jeremycline

Config: https://infrastructure.fedoraproject.org/cgit/ansible.git/tree/roles/fedmsg/base/templates

Reproducer:

#!/usr/bin/python3
import fedmsg
import os
for _, _, topic, msg in fedmsg.tail_messages():
    nopen = len(os.listdir('/proc/self/fd'))
    print("Received msg topic: {}, number open files: {}".format(topic, nopen))

Example output:

Received msg topic: org.fedoraproject.stg.buildsys.task.state.change, number open files: 281
Received msg topic: org.fedoraproject.stg.buildsys.task.state.change, number open files: 282
Received msg topic: org.fedoraproject.stg.buildsys.task.state.change, number open files: 284
Received msg topic: org.fedoraproject.stg.buildsys.task.state.change, number open files: 284
Received msg topic: org.fedoraproject.stg.buildsys.task.state.change, number open files: 285

mizdebsk avatar Feb 22 '18 15:02 mizdebsk

I left the reproducer script running and after some time it died with:

Traceback (most recent call last):
  File "/tmp/fm-reproducer-494.py", line 5, in <module>
    nopen = len(os.listdir('/proc/self/fd'))
OSError: [Errno 24] Too many open files: '/proc/self/fd'

mizdebsk avatar Feb 22 '18 21:02 mizdebsk

I was hoping for a config file rather than a set of ansible templates. I cannot reproduce this with the reproducer script and a minimal configuration that points to 'tcp://fedoraproject.org:9940', so having the exact configuration you're running with in a single file would be helpful.

jeremycline avatar Mar 05 '18 15:03 jeremycline