Two interface disconnects in a short time cause BIND to hang
Important notices
Before you add a new report, we ask you kindly to acknowledge the following:
- [x] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
- [x] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue
Describe the bug
I noticed a hung BIND process in my system today. Closer analysis of the logs revealed the following:
- One of my interfaces,
re1, went down at 07:41:18 and came back up at 07:41:21, three seconds later. - At 07:41:23, a bind reconfiguration was triggered by
rc.linkup. - BIND received a
stopcommand and initiated shutdown at 07:41:24, removing all listeners. - During BIND shutdown, two queries were still received but were answered with SERVFAIL.
- At 07:41:30, the same interface went down again, coming up again a second later at 07:41:31.
- At 07:43:21, a socket timeout was reported (120 seconds after
re1came back for the first time). - This was repeated at 07:45:22, 121 seconds after the first event.
The interface in question, re1, is part of a bridge, bridge1. Neither re1 nor bridge1 have an IP address configured.
The web GUI reported bind as running. Any attempt to stop or restart BIND simply gave me a “Please wait” message which would not go away until I reloaded the page.
On the shell I noticed there was still a running named process, along with a rndc process, several pwait instances for the named process as well as several instances of /bin/sh /usr/local/etc/rc.d/named restart and /bin/sh /usr/local/etc/rc.d/named stop. Apparently each stop/restart attempt just created another stuck instance of the rc.d script (the system also has a cron job that performs a DNS lookup every 30 minutes and issues service named restart if the lookup fails).
I was able to recover by kill -9ing, one after another, the stuck rc.d scripts, then rndc and finally named. After that, the web GUI displayed bind as stopped and I was able to start it, after which it resumed to work normally.
I can only speculate as to why the Ethernet link went down; the cable goes to a wifi bridge with an integrated switch. I am not aware of any power issues (which presumably would have impacted other systems as well, as there is no UPS). However, a flapping LAN connection should not have such an impact on any components running on OPNsense.
To Reproduce
Steps to reproduce the behavior:
- Configure OPNsense with BIND and with an interface which is not configured for IP.
- Simulate a flapping connection by briefly pulling the Ethernet cable, then plugging it back in, repeating several times (duration of connected/disconnected states roughly as described above).
- Observe BIND’s behavior.
Note: Somewhat speculative, as the issue was encountered on a live system and no attempt was made to reproduce it (which likely requires a dedicated test system running on real hardware). Timing appears to be a factor, so it might take several attempts to get this right.
Expected behavior
First of all, I would have expected BIND (or any other service) to deal with such conditions (brief flapping state of an interface, a second reconfiguration attempt while reconfiguration is still in progress) gracefully. I can only speculate that the brief flapping state of the interface triggered two bind reconfigurations in a short time, and the second one came along before the first had finished. If this turns out to be the issue, bind needs to be modified to handle this gracefully – at the worst, the rc.d script could be modified to prevent multiple instances of itself running (exit with an error if another instance is discovered).
Besides, any reconfiguration of IP-related services (IPsec, DHCP, DNS) seem pretty pointless if the interface is not configured to have an IP address of any kind, whether static or dynamic, whether bound directly to the interface or indirectly (as via a bridge). If OPNsense had simply skipped these parts for interfaces not configured for IP, there would not have been any issues in this particular case. However, this would have somewhat alleviated the issue but not avoided it entirely.
Describe alternatives you considered
See above: skipping reconfiguration of IP-related services when a non-IP interface changes state would alleviate the issue somewhat. If such an interface were to enter a flapping state, the issue mentioned above would not arise – this should definitely be considered, as even when everything works as it should, these reconfiguration hooks would still burn unnecessary CPU cycles. However, an IP interface entering a flapping state could still cause issues, so getting services to handle a new reconfig request (while a previous one is still in process) is definitely a must.
Screenshots
ps ax output for the stuck processes:
32357 - S 0:00.03 /usr/local/sbin/rndc -c /usr/local/etc/namedb/rndc.conf stop
81019 - Ss 9:55.37 /usr/local/sbin/named -u bind -c /usr/local/etc/namedb/named.conf
Relevant log files
System log:
2024-05-02T07:45:22 Error configctl error in configd communication Traceback (most recent call last): File "/usr/local/sbin/configctl", line 65, in exec_config_cmd line = sock.recv(65536).decode() socket.timeout: timed out
2024-05-02T07:43:24 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dns (execute task : bind_configure_do())
2024-05-02T07:43:24 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dns ()
2024-05-02T07:43:23 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dhcp (execute task : dhcpd_dhcp_configure())
2024-05-02T07:43:23 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dhcp ()
2024-05-02T07:43:23 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure ipsec (execute task : ipsec_configure_do(,opt2))
2024-05-02T07:43:23 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure ipsec (,opt2)
2024-05-02T07:43:23 Notice opnsense /usr/local/etc/rc.linkup: ROUTING: entering configure using 'opt2'
2024-05-02T07:43:23 Error opnsense /usr/local/etc/rc.linkup: The command `/sbin/ifconfig 'bridge1' addm 're1'' failed to execute
2024-05-02T07:43:22 Notice opnsense /usr/local/etc/rc.linkup: DEVD: Ethernet attached event for opt2(re1)
2024-05-02T07:43:22 Notice opnsense /usr/local/etc/rc.linkup: DEVD: Ethernet detached event for opt2(re1)
2024-05-02T07:43:21 Error configctl error in configd communication Traceback (most recent call last): File "/usr/local/sbin/configctl", line 65, in exec_config_cmd line = sock.recv(65536).decode() socket.timeout: timed out
2024-05-02T07:41:31 Notice kernel <6>re1: link state changed to UP
2024-05-02T07:41:30 Notice kernel <6>re1: link state changed to DOWN
2024-05-02T07:41:23 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dns (execute task : bind_configure_do())
2024-05-02T07:41:23 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dns ()
2024-05-02T07:41:22 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dhcp (execute task : dhcpd_dhcp_configure())
2024-05-02T07:41:22 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dhcp ()
2024-05-02T07:41:22 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure ipsec (execute task : ipsec_configure_do(,opt2))
2024-05-02T07:41:22 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure ipsec (,opt2)
2024-05-02T07:41:22 Notice opnsense /usr/local/etc/rc.linkup: ROUTING: entering configure using 'opt2'
2024-05-02T07:41:22 Error opnsense /usr/local/etc/rc.linkup: The command `/sbin/ifconfig 'bridge1' addm 're1'' failed to execute
2024-05-02T07:41:21 Notice opnsense /usr/local/etc/rc.linkup: DEVD: Ethernet attached event for opt2(re1)
2024-05-02T07:41:21 Notice kernel <6>re1: link state changed to UP
2024-05-02T07:41:18 Notice opnsense /usr/local/etc/rc.linkup: DEVD: Ethernet detached event for opt2(re1)
2024-05-02T07:41:18 Notice kernel <6>re1: link state changed to DOWN
BIND log:
2024-05-02T07:41:24 Informational named 02-May-2024 07:41:24.994 query-errors: info: client @0x840c57160 IP_REDACTED#58639 (3.debian.pool.ntp.org): query failed (SERVFAIL) for 3.debian.pool.ntp.org/IN/AAAA at query.c:6378
2024-05-02T07:41:24 Informational named 02-May-2024 07:41:24.994 query-errors: info: client @0x840c10b60 IP_REDACTED#58639 (3.debian.pool.ntp.org): query failed (SERVFAIL) for 3.debian.pool.ntp.org/IN/A at query.c:6378
2024-05-02T07:41:24 Notice named 02-May-2024 07:41:24.991 general: notice: stopping command channel on 127.0.0.1#9530
2024-05-02T07:41:24 Informational named 02-May-2024 07:41:24.991 general: info: shutting down: flushing changes
2024-05-02T07:41:24 Informational named 02-May-2024 07:41:24.990 network: info: no longer listening on IP_REDACTED#53
2024-05-02T07:41:24 Informational named 02-May-2024 07:41:24.989 network: info: no longer listening on IP_REDACTED#53
2024-05-02T07:41:24 Informational named 02-May-2024 07:41:24.987 network: info: no longer listening on IP_REDACTED#53
2024-05-02T07:41:24 Informational named 02-May-2024 07:41:24.985 general: info: received control channel command 'stop'
Additional context
Add any other context about the problem here.
Environment
OPNsense 24.1.5_3-amd64 FreeBSD 13.2-RELEASE-p11 OpenSSL 3.0.13