fedmsg
fedmsg copied to clipboard
Leaking file descriptors?
In koschei we have a simple tail_messages
loop. Recently it started dying (approximately every hour) with:
IOError: [Errno 24] Too many open files: '/etc/pki/fedmsg/ca.crt'
Looking at /proc/PID/fd
when it's still running, it contains mostly the following repeated many times:
lrwx------. 1 koschei koschei 64 Nov 30 18:19 27 -> socket:[3118780725]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 270 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 271 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 272 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 273 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 274 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 275 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 276 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 277 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 278 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 279 -> anon_inode:[eventfd]
CC @mizdebsk
Out of curiosity, which fedmsg version do you run?
fedmsg-0.19.1-1.fc26.noarch
That seems old, I though infra repos always backported the current version
@msimacek not if the normal stable repos (like, in this case f26) contain newer.
Is this still present in the latest release?
Yes, it is still reproducible. In 2 hours it leaked several hundreds file descriptors.
[root@koschei-backend01 ~][PROD]# date
Fri Dec 8 10:54:04 UTC 2017
[root@koschei-backend01 ~][PROD]# ps -p 7093 -wo pid,lstart,cmd
PID STARTED CMD
7093 Fri Dec 8 08:33:28 2017 /usr/libexec/koschei/koschei-watcher -m koschei.backend.main watcher
[root@koschei-backend01 ~][PROD]# ls /proc/7093/fd | wc -l
629
[root@koschei-backend01 ~][PROD]# rpm -q fedmsg
fedmsg-1.0.1-4.fc26.noarch
Interestingly, fedmsg-tail-3
opens only 11 fds when ran locally, but it immediately has ~250 of them open when I run it from koschei-backend01.stg. fedmsg-tail-2
is much worse - 12 fds locally vs ~600 when ran from koschei-backend01 (prod). Could this be a problem with our fedmsg config (/etc/fedmsg.d/) in Fedora infra?
Ahh, I realized https://github.com/fedora-infra/fedmsg/pull/484 never made it into a release. It rewrites the CRL and CA loading. I'm guessing that will fix this problem (assuming what's leaking is the CRL/CA, which wouldn't surprise me). If it's easy for you to try the tip of the current development branch, that'd be great to confirm.
Interestingly, fedmsg-tail-3 opens only 11 fds when ran locally, but it immediately has ~250 of them open when I run it from koschei-backend01.stg.
I'm guessing if you counted up the "endpoints" in the various configurations it'd come out to 230 or something. zmq will open a TCP connection to each of those. I'm surprised there's such a big difference between Python 2 and 3, though. Off the top of my head I can't think of what would account for that.
We have upgraded our production to Fedora 27, Python 3 and fedmsg-1.1.0-1.fc27.noarch
. The problem still occurs.
Can you provide a minimal reproducer + config? Figuring out when this started would be helpful, as well. Does it occur using m2crypto for validation, or only for cryptography+pyopenssl? Only in Python 3?
Config: https://infrastructure.fedoraproject.org/cgit/ansible.git/tree/roles/fedmsg/base/templates
Reproducer:
#!/usr/bin/python3
import fedmsg
import os
for _, _, topic, msg in fedmsg.tail_messages():
nopen = len(os.listdir('/proc/self/fd'))
print("Received msg topic: {}, number open files: {}".format(topic, nopen))
Example output:
Received msg topic: org.fedoraproject.stg.buildsys.task.state.change, number open files: 281
Received msg topic: org.fedoraproject.stg.buildsys.task.state.change, number open files: 282
Received msg topic: org.fedoraproject.stg.buildsys.task.state.change, number open files: 284
Received msg topic: org.fedoraproject.stg.buildsys.task.state.change, number open files: 284
Received msg topic: org.fedoraproject.stg.buildsys.task.state.change, number open files: 285
I left the reproducer script running and after some time it died with:
Traceback (most recent call last):
File "/tmp/fm-reproducer-494.py", line 5, in <module>
nopen = len(os.listdir('/proc/self/fd'))
OSError: [Errno 24] Too many open files: '/proc/self/fd'
I was hoping for a config file rather than a set of ansible templates. I cannot reproduce this with the reproducer script and a minimal configuration that points to 'tcp://fedoraproject.org:9940', so having the exact configuration you're running with in a single file would be helpful.