plugins icon indicating copy to clipboard operation
plugins copied to clipboard

Two interface disconnects in a short time cause BIND to hang

Open mvglasow opened this issue 1 year ago • 0 comments

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

  • [x] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
  • [x] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue

Describe the bug

I noticed a hung BIND process in my system today. Closer analysis of the logs revealed the following:

  • One of my interfaces, re1, went down at 07:41:18 and came back up at 07:41:21, three seconds later.
  • At 07:41:23, a bind reconfiguration was triggered by rc.linkup.
  • BIND received a stop command and initiated shutdown at 07:41:24, removing all listeners.
  • During BIND shutdown, two queries were still received but were answered with SERVFAIL.
  • At 07:41:30, the same interface went down again, coming up again a second later at 07:41:31.
  • At 07:43:21, a socket timeout was reported (120 seconds after re1 came back for the first time).
  • This was repeated at 07:45:22, 121 seconds after the first event.

The interface in question, re1, is part of a bridge, bridge1. Neither re1 nor bridge1 have an IP address configured.

The web GUI reported bind as running. Any attempt to stop or restart BIND simply gave me a “Please wait” message which would not go away until I reloaded the page.

On the shell I noticed there was still a running named process, along with a rndc process, several pwait instances for the named process as well as several instances of /bin/sh /usr/local/etc/rc.d/named restart and /bin/sh /usr/local/etc/rc.d/named stop. Apparently each stop/restart attempt just created another stuck instance of the rc.d script (the system also has a cron job that performs a DNS lookup every 30 minutes and issues service named restart if the lookup fails).

I was able to recover by kill -9ing, one after another, the stuck rc.d scripts, then rndc and finally named. After that, the web GUI displayed bind as stopped and I was able to start it, after which it resumed to work normally.

I can only speculate as to why the Ethernet link went down; the cable goes to a wifi bridge with an integrated switch. I am not aware of any power issues (which presumably would have impacted other systems as well, as there is no UPS). However, a flapping LAN connection should not have such an impact on any components running on OPNsense.

To Reproduce

Steps to reproduce the behavior:

  1. Configure OPNsense with BIND and with an interface which is not configured for IP.
  2. Simulate a flapping connection by briefly pulling the Ethernet cable, then plugging it back in, repeating several times (duration of connected/disconnected states roughly as described above).
  3. Observe BIND’s behavior.

Note: Somewhat speculative, as the issue was encountered on a live system and no attempt was made to reproduce it (which likely requires a dedicated test system running on real hardware). Timing appears to be a factor, so it might take several attempts to get this right.

Expected behavior

First of all, I would have expected BIND (or any other service) to deal with such conditions (brief flapping state of an interface, a second reconfiguration attempt while reconfiguration is still in progress) gracefully. I can only speculate that the brief flapping state of the interface triggered two bind reconfigurations in a short time, and the second one came along before the first had finished. If this turns out to be the issue, bind needs to be modified to handle this gracefully – at the worst, the rc.d script could be modified to prevent multiple instances of itself running (exit with an error if another instance is discovered).

Besides, any reconfiguration of IP-related services (IPsec, DHCP, DNS) seem pretty pointless if the interface is not configured to have an IP address of any kind, whether static or dynamic, whether bound directly to the interface or indirectly (as via a bridge). If OPNsense had simply skipped these parts for interfaces not configured for IP, there would not have been any issues in this particular case. However, this would have somewhat alleviated the issue but not avoided it entirely.

Describe alternatives you considered

See above: skipping reconfiguration of IP-related services when a non-IP interface changes state would alleviate the issue somewhat. If such an interface were to enter a flapping state, the issue mentioned above would not arise – this should definitely be considered, as even when everything works as it should, these reconfiguration hooks would still burn unnecessary CPU cycles. However, an IP interface entering a flapping state could still cause issues, so getting services to handle a new reconfig request (while a previous one is still in process) is definitely a must.

Screenshots

ps ax output for the stuck processes:

32357  -  S        0:00.03 /usr/local/sbin/rndc -c /usr/local/etc/namedb/rndc.conf stop
81019  -  Ss       9:55.37 /usr/local/sbin/named -u bind -c /usr/local/etc/namedb/named.conf

Relevant log files

System log:

2024-05-02T07:45:22	Error	configctl	 error in configd communication  Traceback (most recent call last):   File "/usr/local/sbin/configctl", line 65, in exec_config_cmd     line = sock.recv(65536).decode() socket.timeout: timed out
2024-05-02T07:43:24	Notice	opnsense	 /usr/local/etc/rc.linkup: plugins_configure dns (execute task : bind_configure_do())
2024-05-02T07:43:24	Notice	opnsense	 /usr/local/etc/rc.linkup: plugins_configure dns ()
2024-05-02T07:43:23	Notice	opnsense	 /usr/local/etc/rc.linkup: plugins_configure dhcp (execute task : dhcpd_dhcp_configure())
2024-05-02T07:43:23	Notice	opnsense	 /usr/local/etc/rc.linkup: plugins_configure dhcp ()
2024-05-02T07:43:23	Notice	opnsense	 /usr/local/etc/rc.linkup: plugins_configure ipsec (execute task : ipsec_configure_do(,opt2))
2024-05-02T07:43:23	Notice	opnsense	 /usr/local/etc/rc.linkup: plugins_configure ipsec (,opt2)
2024-05-02T07:43:23	Notice	opnsense	 /usr/local/etc/rc.linkup: ROUTING: entering configure using 'opt2'
2024-05-02T07:43:23	Error	opnsense	 /usr/local/etc/rc.linkup: The command `/sbin/ifconfig 'bridge1' addm 're1'' failed to execute
2024-05-02T07:43:22	Notice	opnsense	 /usr/local/etc/rc.linkup: DEVD: Ethernet attached event for opt2(re1)
2024-05-02T07:43:22	Notice	opnsense	 /usr/local/etc/rc.linkup: DEVD: Ethernet detached event for opt2(re1)
2024-05-02T07:43:21	Error	configctl	 error in configd communication  Traceback (most recent call last):   File "/usr/local/sbin/configctl", line 65, in exec_config_cmd     line = sock.recv(65536).decode() socket.timeout: timed out
2024-05-02T07:41:31	Notice	kernel	 <6>re1: link state changed to UP
2024-05-02T07:41:30	Notice	kernel	 <6>re1: link state changed to DOWN
2024-05-02T07:41:23	Notice	opnsense	 /usr/local/etc/rc.linkup: plugins_configure dns (execute task : bind_configure_do())
2024-05-02T07:41:23	Notice	opnsense	 /usr/local/etc/rc.linkup: plugins_configure dns ()
2024-05-02T07:41:22	Notice	opnsense	 /usr/local/etc/rc.linkup: plugins_configure dhcp (execute task : dhcpd_dhcp_configure())
2024-05-02T07:41:22	Notice	opnsense	 /usr/local/etc/rc.linkup: plugins_configure dhcp ()
2024-05-02T07:41:22	Notice	opnsense	 /usr/local/etc/rc.linkup: plugins_configure ipsec (execute task : ipsec_configure_do(,opt2))
2024-05-02T07:41:22	Notice	opnsense	 /usr/local/etc/rc.linkup: plugins_configure ipsec (,opt2)
2024-05-02T07:41:22	Notice	opnsense	 /usr/local/etc/rc.linkup: ROUTING: entering configure using 'opt2'
2024-05-02T07:41:22	Error	opnsense	 /usr/local/etc/rc.linkup: The command `/sbin/ifconfig 'bridge1' addm 're1'' failed to execute
2024-05-02T07:41:21	Notice	opnsense	 /usr/local/etc/rc.linkup: DEVD: Ethernet attached event for opt2(re1)
2024-05-02T07:41:21	Notice	kernel	 <6>re1: link state changed to UP
2024-05-02T07:41:18	Notice	opnsense	 /usr/local/etc/rc.linkup: DEVD: Ethernet detached event for opt2(re1)
2024-05-02T07:41:18	Notice	kernel	 <6>re1: link state changed to DOWN

BIND log:

2024-05-02T07:41:24	Informational	named	 02-May-2024 07:41:24.994 query-errors: info: client @0x840c57160 IP_REDACTED#58639 (3.debian.pool.ntp.org): query failed (SERVFAIL) for 3.debian.pool.ntp.org/IN/AAAA at query.c:6378
2024-05-02T07:41:24	Informational	named	 02-May-2024 07:41:24.994 query-errors: info: client @0x840c10b60 IP_REDACTED#58639 (3.debian.pool.ntp.org): query failed (SERVFAIL) for 3.debian.pool.ntp.org/IN/A at query.c:6378
2024-05-02T07:41:24	Notice	named	 02-May-2024 07:41:24.991 general: notice: stopping command channel on 127.0.0.1#9530
2024-05-02T07:41:24	Informational	named	 02-May-2024 07:41:24.991 general: info: shutting down: flushing changes
2024-05-02T07:41:24	Informational	named	 02-May-2024 07:41:24.990 network: info: no longer listening on IP_REDACTED#53
2024-05-02T07:41:24	Informational	named	 02-May-2024 07:41:24.989 network: info: no longer listening on IP_REDACTED#53
2024-05-02T07:41:24	Informational	named	 02-May-2024 07:41:24.987 network: info: no longer listening on IP_REDACTED#53
2024-05-02T07:41:24	Informational	named	 02-May-2024 07:41:24.985 general: info: received control channel command 'stop'

Additional context

Add any other context about the problem here.

Environment

OPNsense 24.1.5_3-amd64 FreeBSD 13.2-RELEASE-p11 OpenSSL 3.0.13

mvglasow avatar May 02 '24 19:05 mvglasow