sonic-buildimage icon indicating copy to clipboard operation
sonic-buildimage copied to clipboard

[healthd] fix healthd shutdown race

Open stepanblyschak opened this issue 1 year ago • 9 comments
trafficstars

Why I did it

To fix errors that happen when writing to the queue:

Jun  5 23:04:41.798613 r-leopard-56 NOTICE healthd: Caught SIGTERM - exiting...
Jun  5 23:04:41.798985 r-leopard-56 NOTICE healthd: Caught SIGTERM - exiting...
Jun  5 23:04:41.799535 r-leopard-56 NOTICE healthd: Caught SIGTERM - exiting...
Jun  5 23:04:41.806010 r-leopard-56 NOTICE healthd: Caught SIGTERM - exiting...
Jun  5 23:04:41.814075 r-leopard-56 ERR healthd: system_service[Errno 104] Connection reset by peer
Jun  5 23:04:41.824135 r-leopard-56 ERR healthd: Traceback (most recent call last):#012  File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 484, in system_service#012    msg = self.myQ.get(timeout=QUEUE_TIMEOUT)#012  File "<string>", line 2, in get#012  File "/usr/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod#012    kind, result = conn.recv()#012  File "/usr/lib/python3.9/multiprocessing/connection.py", line 255, in recv#012    buf = self._recv_bytes()#012  File "/usr/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes#012    buf = self._recv(4)#012  File "/usr/lib/python3.9/multiprocessing/connection.py", line 384, in _recv#012    chunk = read(handle, remaining)#012ConnectionResetError: [Errno 104] Connection reset by peer
Jun  5 23:04:41.826489 r-leopard-56 INFO healthd[8494]: ERROR:dbus.connection:Exception in handler for D-Bus signal:
Jun  5 23:04:41.826591 r-leopard-56 INFO healthd[8494]: Traceback (most recent call last):
Jun  5 23:04:41.826640 r-leopard-56 INFO healthd[8494]:   File "/usr/lib/python3/dist-packages/dbus/connection.py", line 232, in maybe_handle_message
Jun  5 23:04:41.826686 r-leopard-56 INFO healthd[8494]:     self._handler(*args, **kwargs)
Jun  5 23:04:41.826738 r-leopard-56 INFO healthd[8494]:   File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 82, in on_job_removed
Jun  5 23:04:41.826785 r-leopard-56 INFO healthd[8494]:     self.task_notify(msg)
Jun  5 23:04:41.826831 r-leopard-56 INFO healthd[8494]:   File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 110, in task_notify
Jun  5 23:04:41.826877 r-leopard-56 INFO healthd[8494]:     self.task_queue.put(msg)
Jun  5 23:04:41.826923 r-leopard-56 INFO healthd[8494]:   File "<string>", line 2, in put
Jun  5 23:04:41.826973 r-leopard-56 INFO healthd[8494]:   File "/usr/lib/python3.9/multiprocessing/managers.py", line 808, in _callmethod
Jun  5 23:04:41.827018 r-leopard-56 INFO healthd[8494]:     conn.send((self._id, methodname, args, kwds))
Jun  5 23:04:41.827065 r-leopard-56 INFO healthd[8494]:   File "/usr/lib/python3.9/multiprocessing/connection.py", line 211, in send
Jun  5 23:04:41.827115 r-leopard-56 INFO healthd[8494]:     self._send_bytes(_ForkingPickler.dumps(obj))
Jun  5 23:04:41.827158 r-leopard-56 INFO healthd[8494]:   File "/usr/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
Jun  5 23:04:41.827199 r-leopard-56 INFO healthd[8494]:     self._send(header + buf)
Jun  5 23:04:41.827254 r-leopard-56 INFO healthd[8494]:   File "/usr/lib/python3.9/multiprocessing/connection.py", line 373, in _send
Jun  5 23:04:41.827322 r-leopard-56 INFO healthd[8494]:     n = write(self._handle, buf)
Jun  5 23:04:41.827368 r-leopard-56 INFO healthd[8494]: BrokenPipeError: [Errno 32] Broken pipe
Jun  5 23:04:42.800216 r-leopard-56 NOTICE healthd: Caught SIGTERM - exiting...

When the multiprocessing.Manager is shutdown the queue will raise the above errors. This happens during shutdown - fast-reboot, warm-reboot.

Work item tracking
  • Microsoft ADO (number only):

How I did it

Remove the call to shutdown, the cleanup will happen automatically when GC runs as per documentation - https://docs.python.org/3/library/multiprocessing.html

How to verify it

Run warm-reboot, fast-reboot multiple times and verify no errors in the log.

Which release branch to backport (provide reason below if selected)

  • [ ] 201811
  • [ ] 201911
  • [ ] 202006
  • [ ] 202012
  • [ ] 202106
  • [ ] 202111
  • [x] 202205
  • [x] 202311
  • [x] 202405

Tested branch (Please provide the tested image version)

  • [ ]
  • [ ]

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

stepanblyschak avatar Jul 08 '24 07:07 stepanblyschak

@Junchao-Mellanox could you please help to review? @stepanblyschak please followup on the pr checker failure

liat-grozovik avatar Jul 09 '24 17:07 liat-grozovik

@lguohan @yxieca please help to review and lets for sure take to 202311 and 202405

liat-grozovik avatar Jul 14 '24 12:07 liat-grozovik

/azpw run Azure.sonic-buildimage

stepanblyschak avatar Jul 21 '24 12:07 stepanblyschak

/AzurePipelines run Azure.sonic-buildimage

mssonicbld avatar Jul 21 '24 12:07 mssonicbld

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Jul 21 '24 12:07 azure-pipelines[bot]

This change is causing some other side effects,

During the shutdown flow, system-health is taking > 90 sec for it to shutdown.

root@sonic:/home/admin# journalctl -f -u system-health
Aug 28 19:56:20 r-bobcat-01 healthd[14113]: checked_unit_status for [ multi-user.target ]
Aug 28 19:56:20 r-bobcat-01 healthd[14113]: checked_unit_status for [ getty.target ]
Aug 28 19:56:20 r-bobcat-01 systemd[1]: Stopping system-health.service - SONiC system health monitor...
Aug 28 19:56:20 r-bobcat-01 healthd[14113]: Caught SIGTERM - exiting...
Aug 28 19:56:20 r-bobcat-01 healthd[13967]: Caught SIGTERM - exiting...
Aug 28 19:56:20 r-bobcat-01 healthd[14104]: Caught SIGTERM - exiting...
Aug 28 19:56:21 r-bobcat-01 healthd[14123]: Caught SIGTERM - exiting...
Aug 28 19:56:31 r-bobcat-01 healthd[13967]: Attempting to kill sysmon main process with pid 14113
Aug 28 19:56:31 r-bobcat-01 healthd[13967]: Sysmon main process with pid 14113 could not be killed
Aug 28 19:56:32 r-bobcat-01 healthd[14123]: Caught SIGTERM - exiting...
Aug 28 19:58:01 r-bobcat-01 systemd[1]: system-health.service: State 'final-sigterm' timed out. Killing.
Aug 28 19:58:01 r-bobcat-01 systemd[1]: system-health.service: Killing process 14123 (healthd) with signal SIGKILL.
Aug 28 19:58:01 r-bobcat-01 systemd[1]: system-health.service: Failed with result 'timeout'.
Aug 28 19:58:01 r-bobcat-01 systemd[1]: Stopped system-health.service - SONiC system health monitor.

I think, because of the shutdown call removal, the msg = self.myQ.get(timeout=QUEUE_TIMEOUT) is blocked for 15 sec. Within 10 sec after the SIGTERM (https://github.com/sonic-net/sonic-buildimage/blob/a8c54ad1914f6a4b8f663c7b37909aa7e59a984e/src/system-health/health_checker/sysmonitor.py#L517), The SIGKIll is given and ihe sysmon process gets terminated without ever issuing the task_stop for it's child processes causing it them to block the stop flow for 90 sec until the systemd actually issues SIGKILL.

Please note that this is not seen everytime, when the Q has some items during this time, the get call gets returned and it continues without any problem. The probability of repro is 25%

vivekrnv avatar Aug 30 '24 20:08 vivekrnv

/azp run Azure.sonic-buildimage

liat-grozovik avatar Sep 03 '24 08:09 liat-grozovik

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Sep 03 '24 08:09 azure-pipelines[bot]

@stepanblyschak please address the above issue from Vivek

prgeor avatar Sep 03 '24 16:09 prgeor

@stepanblyschak PR still in draft?

prgeor avatar Oct 15 '24 06:10 prgeor

@stepanblyschak fix build errors?

prgeor avatar Oct 15 '24 06:10 prgeor

@prgeor Please see the bug report by @vivekrnv. This PR re-introduces another problem, so the fix must handle both - the error that is logged during shutdown and the occasional hang of healthd during shutdown

stepanblyschak avatar Oct 16 '24 07:10 stepanblyschak

I investigate the issue a bit more and there are a couple of problems:

  1. Shutdown of the queue will cause errors (see PR description) in other sub processes:
        #Clear the resources of mpmgr- Queue
        self.mpmgr.shutdown()
  1. The shutdown logic will wait at most QUEUE_TIMEOUT = 15 sec before it is able to exit from the task loop.
        while not self.task_stopping_event.is_set():
            try:
                msg = self.myQ.get(timeout=QUEUE_TIMEOUT)
                event = msg["unit"]

The queue self.myQ is blocked for QUEUE_TIMEOUT and we cannot detect the self.task_stopping_event.

  1. Joining the process will exit due to timeout since QUEUE_TIMEOUT > self._stop_timeout_secs.
        # Wait for the process to exit
        self._task_process.join(self._stop_timeout_secs)
  1. When attempting to kill the process:
        # If the process didn't exit, attempt to kill it
        if self._task_process.is_alive():
            logger.log_notice("Attempting to kill sysmon main process with pid {}".format(self._task_process.pid))
            os.kill(self._task_process.pid, signal.SIGKILL)

        if self._task_process.is_alive():
            logger.log_error("Sysmon main process with pid {} could not be killed".format(self._task_process.pid))
            return False

The os.kill will make a zombie process since it hasn't been awaited yet. The is_alive() then returns True. The fix could be to use self._task_process.kill() instead.

  1. The task monitor system bus could not be stopped:
        monitor_system_bus.task_stop()

since it is blocked in MonitorSystemBusTask in loop.run():

        loop = GLib.MainLoop()
        loop.run()

Command line to reproduce:

sudo systemctl start system-health ; sleep 10; echo "$(date): Stopping..."; sudo systemctl stop system-health; echo "$(date): Stopped"

stepanblyschak avatar Oct 16 '24 14:10 stepanblyschak

@prgeor Please review

stepanblyschak avatar Oct 17 '24 10:10 stepanblyschak

/azpw run

stepanblyschak avatar Oct 30 '24 07:10 stepanblyschak

/AzurePipelines run

mssonicbld avatar Oct 30 '24 07:10 mssonicbld

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Oct 30 '24 07:10 azure-pipelines[bot]

Seem unrelated tests failed, restarting:

FAIL: pceplib/test/pcep_msg_tests_valgrind.sh
FAIL: pceplib/test/pcep_pcc_api_tests_valgrind.sh
FAIL: pceplib/test/pcep_session_logic_tests_valgrind.sh
FAIL: pceplib/test/pcep_socket_comm_tests_valgrind.sh
FAIL: pceplib/test/pcep_timers_tests_valgrind.sh
FAIL: pceplib/test/pcep_utils_tests_valgrind.sh

stepanblyschak avatar Nov 05 '24 16:11 stepanblyschak

/azpw run Azure.sonic-buildimage

stepanblyschak avatar Nov 05 '24 16:11 stepanblyschak

/AzurePipelines run Azure.sonic-buildimage

mssonicbld avatar Nov 05 '24 16:11 mssonicbld

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Nov 05 '24 16:11 azure-pipelines[bot]

Broadcom build failed, restarting:

Step 22/32 : COPY ["files/dsserve", "/usr/bin/"]
COPY failed: file not found in build context or excluded by .dockerignore: stat files/dsserve: file does not exist
[  FAIL LOG END  ] [ target/docker-gbsyncd-broncos.gz ]

stepanblyschak avatar Nov 10 '24 09:11 stepanblyschak

/azpw run Azure.sonic-buildimage

stepanblyschak avatar Nov 10 '24 09:11 stepanblyschak

/AzurePipelines run Azure.sonic-buildimage

mssonicbld avatar Nov 10 '24 09:11 mssonicbld

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Nov 10 '24 09:11 azure-pipelines[bot]

/azpw run Azure.sonic-buildimage

stepanblyschak avatar Nov 18 '24 12:11 stepanblyschak

/AzurePipelines run Azure.sonic-buildimage

mssonicbld avatar Nov 18 '24 12:11 mssonicbld

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Nov 18 '24 12:11 azure-pipelines[bot]

Seems like a network issue:

curl: (22) The requested URL returned error: 404
Try 5: /usr/bin/curl --retry-connrefused --retry 5 failed to get: -f
dget: curl ipmitool_1.8.19-4+deb12u1.debian.tar.xz https://deb.debian.org/debian/pool/main/i/ipmitool/ipmitool_1.8.19-4+deb12u1.debian.tar.xz failed
ipmitool_1.8.19-4+deb12u1.dsc:
      Good signature found
   validating ipmitool_1.8.19.orig.tar.gz
   skipping  ipmitool_1.8.19-4+deb12u1.debian.tar.xz (not present)
All files validated successfully.
dpkg-source: error: cannot fstat file ./ipmitool_1.8.19-4+deb12u1.debian.tar.xz: No such file or directory
make[1]: *** [Makefile:9: /sonic/target/debs/bookworm/ipmitool_1.8.19-4+deb12u1_amd64.deb] Error 25

stepanblyschak avatar Nov 25 '24 09:11 stepanblyschak

/azpw run Azure.sonic-buildimage

stepanblyschak avatar Nov 25 '24 09:11 stepanblyschak