sonic-buildimage
sonic-buildimage copied to clipboard
[healthd] fix healthd shutdown race
Why I did it
To fix errors that happen when writing to the queue:
Jun 5 23:04:41.798613 r-leopard-56 NOTICE healthd: Caught SIGTERM - exiting...
Jun 5 23:04:41.798985 r-leopard-56 NOTICE healthd: Caught SIGTERM - exiting...
Jun 5 23:04:41.799535 r-leopard-56 NOTICE healthd: Caught SIGTERM - exiting...
Jun 5 23:04:41.806010 r-leopard-56 NOTICE healthd: Caught SIGTERM - exiting...
Jun 5 23:04:41.814075 r-leopard-56 ERR healthd: system_service[Errno 104] Connection reset by peer
Jun 5 23:04:41.824135 r-leopard-56 ERR healthd: Traceback (most recent call last):#012 File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 484, in system_service#012 msg = self.myQ.get(timeout=QUEUE_TIMEOUT)#012 File "<string>", line 2, in get#012 File "/usr/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod#012 kind, result = conn.recv()#012 File "/usr/lib/python3.9/multiprocessing/connection.py", line 255, in recv#012 buf = self._recv_bytes()#012 File "/usr/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes#012 buf = self._recv(4)#012 File "/usr/lib/python3.9/multiprocessing/connection.py", line 384, in _recv#012 chunk = read(handle, remaining)#012ConnectionResetError: [Errno 104] Connection reset by peer
Jun 5 23:04:41.826489 r-leopard-56 INFO healthd[8494]: ERROR:dbus.connection:Exception in handler for D-Bus signal:
Jun 5 23:04:41.826591 r-leopard-56 INFO healthd[8494]: Traceback (most recent call last):
Jun 5 23:04:41.826640 r-leopard-56 INFO healthd[8494]: File "/usr/lib/python3/dist-packages/dbus/connection.py", line 232, in maybe_handle_message
Jun 5 23:04:41.826686 r-leopard-56 INFO healthd[8494]: self._handler(*args, **kwargs)
Jun 5 23:04:41.826738 r-leopard-56 INFO healthd[8494]: File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 82, in on_job_removed
Jun 5 23:04:41.826785 r-leopard-56 INFO healthd[8494]: self.task_notify(msg)
Jun 5 23:04:41.826831 r-leopard-56 INFO healthd[8494]: File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 110, in task_notify
Jun 5 23:04:41.826877 r-leopard-56 INFO healthd[8494]: self.task_queue.put(msg)
Jun 5 23:04:41.826923 r-leopard-56 INFO healthd[8494]: File "<string>", line 2, in put
Jun 5 23:04:41.826973 r-leopard-56 INFO healthd[8494]: File "/usr/lib/python3.9/multiprocessing/managers.py", line 808, in _callmethod
Jun 5 23:04:41.827018 r-leopard-56 INFO healthd[8494]: conn.send((self._id, methodname, args, kwds))
Jun 5 23:04:41.827065 r-leopard-56 INFO healthd[8494]: File "/usr/lib/python3.9/multiprocessing/connection.py", line 211, in send
Jun 5 23:04:41.827115 r-leopard-56 INFO healthd[8494]: self._send_bytes(_ForkingPickler.dumps(obj))
Jun 5 23:04:41.827158 r-leopard-56 INFO healthd[8494]: File "/usr/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
Jun 5 23:04:41.827199 r-leopard-56 INFO healthd[8494]: self._send(header + buf)
Jun 5 23:04:41.827254 r-leopard-56 INFO healthd[8494]: File "/usr/lib/python3.9/multiprocessing/connection.py", line 373, in _send
Jun 5 23:04:41.827322 r-leopard-56 INFO healthd[8494]: n = write(self._handle, buf)
Jun 5 23:04:41.827368 r-leopard-56 INFO healthd[8494]: BrokenPipeError: [Errno 32] Broken pipe
Jun 5 23:04:42.800216 r-leopard-56 NOTICE healthd: Caught SIGTERM - exiting...
When the multiprocessing.Manager is shutdown the queue will raise the above errors. This happens during shutdown - fast-reboot, warm-reboot.
Work item tracking
- Microsoft ADO (number only):
How I did it
Remove the call to shutdown, the cleanup will happen automatically when GC runs as per documentation - https://docs.python.org/3/library/multiprocessing.html
How to verify it
Run warm-reboot, fast-reboot multiple times and verify no errors in the log.
Which release branch to backport (provide reason below if selected)
- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [x] 202205
- [x] 202311
- [x] 202405
Tested branch (Please provide the tested image version)
- [ ]
- [ ]
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)
@Junchao-Mellanox could you please help to review? @stepanblyschak please followup on the pr checker failure
@lguohan @yxieca please help to review and lets for sure take to 202311 and 202405
/azpw run Azure.sonic-buildimage
/AzurePipelines run Azure.sonic-buildimage
Azure Pipelines successfully started running 1 pipeline(s).
This change is causing some other side effects,
During the shutdown flow, system-health is taking > 90 sec for it to shutdown.
root@sonic:/home/admin# journalctl -f -u system-health
Aug 28 19:56:20 r-bobcat-01 healthd[14113]: checked_unit_status for [ multi-user.target ]
Aug 28 19:56:20 r-bobcat-01 healthd[14113]: checked_unit_status for [ getty.target ]
Aug 28 19:56:20 r-bobcat-01 systemd[1]: Stopping system-health.service - SONiC system health monitor...
Aug 28 19:56:20 r-bobcat-01 healthd[14113]: Caught SIGTERM - exiting...
Aug 28 19:56:20 r-bobcat-01 healthd[13967]: Caught SIGTERM - exiting...
Aug 28 19:56:20 r-bobcat-01 healthd[14104]: Caught SIGTERM - exiting...
Aug 28 19:56:21 r-bobcat-01 healthd[14123]: Caught SIGTERM - exiting...
Aug 28 19:56:31 r-bobcat-01 healthd[13967]: Attempting to kill sysmon main process with pid 14113
Aug 28 19:56:31 r-bobcat-01 healthd[13967]: Sysmon main process with pid 14113 could not be killed
Aug 28 19:56:32 r-bobcat-01 healthd[14123]: Caught SIGTERM - exiting...
Aug 28 19:58:01 r-bobcat-01 systemd[1]: system-health.service: State 'final-sigterm' timed out. Killing.
Aug 28 19:58:01 r-bobcat-01 systemd[1]: system-health.service: Killing process 14123 (healthd) with signal SIGKILL.
Aug 28 19:58:01 r-bobcat-01 systemd[1]: system-health.service: Failed with result 'timeout'.
Aug 28 19:58:01 r-bobcat-01 systemd[1]: Stopped system-health.service - SONiC system health monitor.
I think, because of the shutdown call removal, the msg = self.myQ.get(timeout=QUEUE_TIMEOUT) is blocked for 15 sec. Within 10 sec after the SIGTERM (https://github.com/sonic-net/sonic-buildimage/blob/a8c54ad1914f6a4b8f663c7b37909aa7e59a984e/src/system-health/health_checker/sysmonitor.py#L517), The SIGKIll is given and ihe sysmon process gets terminated without ever issuing the task_stop for it's child processes causing it them to block the stop flow for 90 sec until the systemd actually issues SIGKILL.
Please note that this is not seen everytime, when the Q has some items during this time, the get call gets returned and it continues without any problem. The probability of repro is 25%
/azp run Azure.sonic-buildimage
Azure Pipelines successfully started running 1 pipeline(s).
@stepanblyschak please address the above issue from Vivek
@stepanblyschak PR still in draft?
@stepanblyschak fix build errors?
@prgeor Please see the bug report by @vivekrnv. This PR re-introduces another problem, so the fix must handle both - the error that is logged during shutdown and the occasional hang of healthd during shutdown
I investigate the issue a bit more and there are a couple of problems:
- Shutdown of the queue will cause errors (see PR description) in other sub processes:
#Clear the resources of mpmgr- Queue
self.mpmgr.shutdown()
- The shutdown logic will wait at most QUEUE_TIMEOUT = 15 sec before it is able to exit from the task loop.
while not self.task_stopping_event.is_set():
try:
msg = self.myQ.get(timeout=QUEUE_TIMEOUT)
event = msg["unit"]
The queue self.myQ is blocked for QUEUE_TIMEOUT and we cannot detect the self.task_stopping_event.
- Joining the process will exit due to timeout since
QUEUE_TIMEOUT>self._stop_timeout_secs.
# Wait for the process to exit
self._task_process.join(self._stop_timeout_secs)
- When attempting to kill the process:
# If the process didn't exit, attempt to kill it
if self._task_process.is_alive():
logger.log_notice("Attempting to kill sysmon main process with pid {}".format(self._task_process.pid))
os.kill(self._task_process.pid, signal.SIGKILL)
if self._task_process.is_alive():
logger.log_error("Sysmon main process with pid {} could not be killed".format(self._task_process.pid))
return False
The os.kill will make a zombie process since it hasn't been awaited yet. The is_alive() then returns True. The fix could be to use self._task_process.kill() instead.
- The task monitor system bus could not be stopped:
monitor_system_bus.task_stop()
since it is blocked in MonitorSystemBusTask in loop.run():
loop = GLib.MainLoop()
loop.run()
Command line to reproduce:
sudo systemctl start system-health ; sleep 10; echo "$(date): Stopping..."; sudo systemctl stop system-health; echo "$(date): Stopped"
@prgeor Please review
/azpw run
/AzurePipelines run
Azure Pipelines successfully started running 1 pipeline(s).
Seem unrelated tests failed, restarting:
FAIL: pceplib/test/pcep_msg_tests_valgrind.sh
FAIL: pceplib/test/pcep_pcc_api_tests_valgrind.sh
FAIL: pceplib/test/pcep_session_logic_tests_valgrind.sh
FAIL: pceplib/test/pcep_socket_comm_tests_valgrind.sh
FAIL: pceplib/test/pcep_timers_tests_valgrind.sh
FAIL: pceplib/test/pcep_utils_tests_valgrind.sh
/azpw run Azure.sonic-buildimage
/AzurePipelines run Azure.sonic-buildimage
Azure Pipelines successfully started running 1 pipeline(s).
Broadcom build failed, restarting:
Step 22/32 : COPY ["files/dsserve", "/usr/bin/"]
COPY failed: file not found in build context or excluded by .dockerignore: stat files/dsserve: file does not exist
[ FAIL LOG END ] [ target/docker-gbsyncd-broncos.gz ]
/azpw run Azure.sonic-buildimage
/AzurePipelines run Azure.sonic-buildimage
Azure Pipelines successfully started running 1 pipeline(s).
/azpw run Azure.sonic-buildimage
/AzurePipelines run Azure.sonic-buildimage
Azure Pipelines successfully started running 1 pipeline(s).
Seems like a network issue:
curl: (22) The requested URL returned error: 404
Try 5: /usr/bin/curl --retry-connrefused --retry 5 failed to get: -f
dget: curl ipmitool_1.8.19-4+deb12u1.debian.tar.xz https://deb.debian.org/debian/pool/main/i/ipmitool/ipmitool_1.8.19-4+deb12u1.debian.tar.xz failed
ipmitool_1.8.19-4+deb12u1.dsc:
Good signature found
validating ipmitool_1.8.19.orig.tar.gz
skipping ipmitool_1.8.19-4+deb12u1.debian.tar.xz (not present)
All files validated successfully.
dpkg-source: error: cannot fstat file ./ipmitool_1.8.19-4+deb12u1.debian.tar.xz: No such file or directory
make[1]: *** [Makefile:9: /sonic/target/debs/bookworm/ipmitool_1.8.19-4+deb12u1_amd64.deb] Error 25
/azpw run Azure.sonic-buildimage