OpenDKIM
OpenDKIM copied to clipboard
Large number of file descriptors used in RHEL 9
Currently experiencing an interesting issue where opendkim is exhausting all its file descriptors leading to postfix milter reject errors. This recently started happening after having the systems running for several months in RHEL 9. From what I've been able to tell there seems to be some issue with opendkim when on a system that's using sssd 2.9.1-4.el9_3.1 or greater. I have already opened a case with Red Hat but since opendkim is not part of their repo's they can't say if it's a bug with sssd or opendkim.
For some background: We're using several domains to send from with around 50+ keys. Using, ss -anp | grep 'opendkim' | sed 's/ *$//', we can see streams grow after each email that get's signed. Below is a small example. Generally, we only really see maybe 11 streams at a time and they don't stay open long.
` u_str ESTAB 0 0 * 53817391 * 0 users:(("opendkim",pid=1856103,fd=77)) u_str ESTAB 0 0 * 53832275 * 0 users:(("opendkim",pid=1856103,fd=218)) u_str ESTAB 0 0 * 53824876 * 0 users:(("opendkim",pid=1856103,fd=124)) u_str ESTAB 0 0 * 53824756 * 0 users:(("opendkim",pid=1856103,fd=133)) u_str ESTAB 0 0 * 53836020 * 0 users:(("opendkim",pid=1856103,fd=255)) u_str ESTAB 0 0 * 53838661 * 0 users:(("opendkim",pid=1856103,fd=285)) u_str ESTAB 0 0 * 53808986 * 0 users:(("opendkim",pid=1856103,fd=30)) u_str ESTAB 0 0 * 53835631 * 0 users:(("opendkim",pid=1856103,fd=270)) u_str ESTAB 0 0 * 53832887 * 0 users:(("opendkim",pid=1856103,fd=245)) u_str ESTAB 0 0 * 53838157 * 0 users:(("opendkim",pid=1856103,fd=283)) u_str ESTAB 0 0 * 53840434 * 0 users:(("opendkim",pid=1856103,fd=305)) u_str ESTAB 0 0 * 53830151 * 0 users:(("opendkim",pid=1856103,fd=182)) u_str ESTAB 0 0 * 53830225 * 0 users:(("opendkim",pid=1856103,fd=187)) u_str ESTAB 0 0 * 53830723 * 0 users:(("opendkim",pid=1856103,fd=191)) u_str ESTAB 0 0 * 53828693 * 0 users:(("opendkim",pid=1856103,fd=176)) u_str ESTAB 0 0 * 53827800 * 0 users:(("opendkim",pid=1856103,fd=171)) u_str ESTAB 0 0 * 53825776 * 0 users:(("opendkim",pid=1856103,fd=146)) u_str ESTAB 0 0 * 53834975 * 0 users:(("opendkim",pid=1856103,fd=259)) u_str ESTAB 0 0 * 53808916 * 0 users:(("opendkim",pid=1856103,fd=23)) u_str ESTAB 0 0 * 53810481 * 0 users:(("opendkim",pid=1856103,fd=33)) u_str ESTAB 0 0 * 53832456 * 0 users:(("opendkim",pid=1856103,fd=234)) u_str ESTAB 0 0 * 53810634 * 0 users:(("opendkim",pid=1856103,fd=47)) u_str ESTAB 0 0 * 53835245 * 0 users:(("opendkim",pid=1856103,fd=260)) u_str ESTAB 0 0 * 53836391 * 0 users:(("opendkim",pid=1856103,fd=265)) u_str ESTAB 0 0 * 53827932 * 0 users:(("opendkim",pid=1856103,fd=175)) u_str ESTAB 0 0 * 53826487 * 0 users:(("opendkim",pid=1856103,fd=163)) tcp LISTEN 0 4096 127.0.0.1:8891 0.0.0.0:* users:(("opendkim",pid=1856103,fd=3))
`
After some time there will be several hundred (if not more) open streams. Eventually it'll grow and we'll start seeing milter-rejects in postfix logs. We can get things working again by restarting opendkim.
warning: milter inet:127.0.0.1:8891: can't read SMFIC_OPTNEG reply packet header: Connection reset by peer warning: milter inet:127.0.0.1:8891: read error in initial handshake NOQUEUE: milter-reject: CONNECT from server [IP]: 451 4.7.1 Service unavailable - try again later; proto=SMTP
This only started after we updated sssd to version 2.9.1-4.el9_3.1. When we downgrade to 2.9.1-4.el9_3 it doesn't happen. Like stated before Redhat is not able to confirm if this is a bug with sssd or opendkim since opendkim is not from their repos. We use the version from EPEL.
There is a similar post in the Rocky Linux forums with some additional information that seems to match the same issue we are seeing, https://forums.rockylinux.org/t/rocky-9-3-opendkim-latest-sssd-packages/12327.
I believe this is tied to nss. When enabling debug logging with sssd I can see that opendkim shows up as a client.
(2024-01-31 10:16:01): [nss] [accept_fd_handler] (0x0400): [CID#3] Client [cmd /usr/sbin/opendkim][uid 34188][0x562038d3c500][23] connected! (2024-01-31 10:16:01): [nss] [sss_cmd_get_version] (0x0200): [CID#3] Received client version [1]. (2024-01-31 10:16:01): [nss] [sss_cmd_get_version] (0x0200): [CID#3] Offered version [1]. (2024-01-31 10:16:01): [nss] [cache_req_set_plugin] (0x2000): [CID#3] CR #2: Setting "Enumerate users" plugin (2024-01-31 10:16:01): [nss] [cache_req_send] (0x0400): [CID#3] CR #2: REQ_TRACE: New request [CID #3] 'Enumerate users' (2024-01-31 10:16:01): [nss] [cache_req_select_domains] (0x0400): [CID#3] CR #2: Performing a multi-domain search (2024-01-31 10:16:01): [nss] [cache_req_search_domains] (0x0400): [CID#3] CR #2: Search will bypass the cache and check the data provider (2024-01-31 10:16:01): [nss] [cache_req_validate_domain_enumeration] (0x0400): [CID#3] CR #2: Domain mydomain.com does not support enumeration, skipping... (2024-01-31 10:16:01): [nss] [cache_req_validate_domain_enumeration] (0x0400): [CID#3] CR #2: Enumeration requested but not enabled (2024-01-31 10:16:01): [nss] [cache_req_global_ncache_add] (0x2000): [CID#3] CR #2: This request type does not support global negative cache (2024-01-31 10:16:01): [nss] [cache_req_process_result] (0x0400): [CID#3] CR #2: Finished: Not found (2024-01-31 10:16:01): [nss] [sss_nss_protocol_done] (0x4000): [CID#3] Sending reply: success (2024-01-31 10:16:01): [nss] [cache_req_set_plugin] (0x2000): [CID#3] CR #3: Setting "Enumerate users" plugin (2024-01-31 10:16:01): [nss] [cache_req_send] (0x0400): [CID#3] CR #3: REQ_TRACE: New request [CID #3] 'Enumerate users' (2024-01-31 10:16:01): [nss] [cache_req_select_domains] (0x0400): [CID#3] CR #3: Performing a multi-domain search (2024-01-31 10:16:01): [nss] [cache_req_search_domains] (0x0400): [CID#3] CR #3: Search will bypass the cache and check the data provider (2024-01-31 10:16:01): [nss] [cache_req_validate_domain_enumeration] (0x0400): [CID#3] CR #3: Domain mydomain.com does not support enumeration, skipping... (2024-01-31 10:16:01): [nss] [cache_req_global_ncache_add] (0x2000): [CID#3] CR #3: This request type does not support global negative cache (2024-01-31 10:16:01): [nss] [cache_req_process_result] (0x0400): [CID#3] CR #3: Finished: Not found (2024-01-31 10:16:01): [nss] [sss_nss_protocol_done] (0x4000): [CID#3] Sending reply: not found (2024-01-31 10:16:01): [nss] [sss_nss_endent] (0x0100): [CID#3] Resetting enumeration state (2024-01-31 10:16:01): [nss] [sss_nss_protocol_done] (0x4000): [CID#3] Sending reply: success (2024-01-31 10:16:02): [nss] [get_client_cred] (0x4000): Client [0x562038cec2c0][24] creds: euid[34188] egid[34188] pid[1884642] cmd_line['/usr/sbin/opendkim']. (2024-01-31 10:16:02): [nss] [setup_client_idle_timer] (0x4000): Idle timer re-set for client [0x562038cec2c0][24]
Hi, I have similar issue on Amazon Linux 2023 and SSSD 2.9.1 abd 2.9.4. I've got to downgrade to SSSD 2.5.0 to get it works fine.
We are considering an OS upgrade to Rocky9, wondering if this is still an issue.
We encountered this issue for the first time a few days ago (May 30, 2025) after introducing our first RHEL 9 mail server the day before. The problem manifested itself when the number of open file descriptors exceeded 1024 (which is the default limit on number of open files per process). The majority of open files were related to name resolution, e.g.:
opendkim 1488 opendkim 1216u IPv4 49218967 0t0 UDP [HOSTNAME]:44592->[NAMESERVER]:domain
We are running the following OS, opendkim, and sssd versions:
OS: 5.14.0-503.40.1.el9_5.x86_64 opendkim: 2.11.0 (Compiled with OpenSSL 3.2.2 4 Jun 2024) sssd: 2.9.5
Note: We've been using this exact version of opendkim in RHEL 7 without this issue for several years. This particular server only does DKIM verification, not DKIM signing.
Has there been any progress on this issue?
This appears to be a classic file descriptor leak. In Linux, these are almost always an internal problem with the application itself.
The team should troubleshoot the program that is leaking the descriptors.
In Linux, file descriptors are managed by the kernel on a per-process basis. Each process has its own private table of open file descriptors, and one process cannot directly manipulate another's resources. This process isolation is a core security and stability feature of the OS.
Should you think this is SSSD issue, please take a look on:
https://sssd.io/troubleshooting/basics.html
Then file an issue in sssd, with relevant log entries: https://sssd.io/docs/reporting-bugs.html
We currently have to restart opendkim daily due to the FD leak. We've tried eliminating sssd from use by opendkim to no avail (since there were reports that the sssd libraries may have introduced the problem). We did NOT try backing out the sssd libraries as that seems extreme.
I'm open to suggestions to assist in fixing this FD leak which has only occurred after upgrading from RHEL 7 to RHEL 9 (where opendkim was unchanged at v2.11.0)