[system monitor]ERR healthd: system_servicejoin() argument must be str, bytes, or os.PathLike object, not 'NoneType'
Description
While performing config save followed by config reload sometimes we get the following log
ERR healthd: system_servicejoin() argument must be str, bytes, or os.PathLike object, not 'NoneType'
Steps to reproduce the issue:
- config save
- config reload -y -f
Describe the results you received:
Error in syslog
Describe the results you expected:
No error in syslog
Output of show version:
SONiC Software Version: SONiC.202311_RC.39-c50d88168_Internal
SONiC OS Version: 11
Distribution: Debian 11.9
Kernel: 5.10.0-23-2-amd64
Build commit: c78ff9d63
Build date: Fri Apr 26 05:01:25 UTC 2024
Built by: sw-r2d2-bot@r-build-sonic-ci03-241
Platform: x86_64-nvidia_sn5600_simx-r0
HwSKU: ACS-SN5600
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2315XZ04ZJ
Model Number: 920-9N42F-00RS-5NA
Hardware Revision: A1
Uptime: 03:36:42 up 1:24, 1 user, load average: 1.91, 3.44, 2.24
Date: Mon 29 Apr 2024 03:36:42
Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-dhcp-relay latest 1a4c76eda529 324MB
docker-platform-monitor 202311_RC.39-c50d88168_Internal 693addbace38 821MB
docker-platform-monitor latest 693addbace38 821MB
docker-macsec latest 07000709328f 344MB
docker-orchagent 202311_RC.39-c50d88168_Internal 278069786798 353MB
docker-orchagent latest 278069786798 353MB
docker-eventd 202311_RC.39-c50d88168_Internal af8d08dce832 315MB
docker-eventd latest af8d08dce832 315MB
docker-snmp 202311_RC.39-c50d88168_Internal 6a51b8d8f606 354MB
docker-snmp latest 6a51b8d8f606 354MB
docker-nat 202311_RC.39-c50d88168_Internal 739b3809fe31 345MB
docker-nat latest 739b3809fe31 345MB
docker-sflow 202311_RC.39-c50d88168_Internal 164f4326030d 343MB
docker-sflow latest 164f4326030d 343MB
docker-fpm-frr 202311_RC.39-c50d88168_Internal 5bd54c2d63e0 373MB
docker-fpm-frr latest 5bd54c2d63e0 373MB
docker-syncd-mlnx 202311_RC.39-c50d88168_Internal 5f8046eaefce 833MB
docker-syncd-mlnx latest 5f8046eaefce 833MB
docker-teamd 202311_RC.39-c50d88168_Internal f4416035b8f5 342MB
docker-teamd latest f4416035b8f5 342MB
docker-sonic-gnmi 202311_RC.39-c50d88168_Internal fe28d796529d 403MB
docker-sonic-gnmi latest fe28d796529d 403MB
docker-mux 202311_RC.39-c50d88168_Internal 8feaaeda5785 364MB
docker-mux latest 8feaaeda5785 364MB
docker-lldp 202311_RC.39-c50d88168_Internal ad04c3d79223 357MB
docker-lldp latest ad04c3d79223 357MB
docker-database 202311_RC.39-c50d88168_Internal fe6fa16c1643 315MB
docker-database latest fe6fa16c1643 315MB
docker-router-advertiser 202311_RC.39-c50d88168_Internal 2c52659a0d45 315MB
docker-router-advertiser latest 2c52659a0d45 315MB
docker-sonic-mgmt-framework 202311_RC.39-c50d88168_Internal a34baf831465 417MB
docker-sonic-mgmt-framework latest a34baf831465 417MB
Output of show techsupport:
(paste your output here or download and attach the file here )
Additional information you deem important (e.g. issue happens only occasionally):
@sg893052 @adyeung FYI
@dgsudharsan @adyeung Found the issue, it is due to EOFError from the queue processing during queue shutdown.
The fix already exists in the master code --> https://github.com/sonic-net/sonic-buildimage/blob/master/src/system-health/health_checker/sysmonitor.py#L485
Please backport it accordingly.
@sg893052 please share the PR in master so we can add the relevant label for the backport.
@sg893052 please share the PR in master so we can add the relevant label for the backport. https://github.com/sonic-net/sonic-buildimage/pull/17459 is the PR in master
@sg893052 Even with the PR we see the issue.
@sg893052 Even with the PR we see the issue.
@dgsudharsan Please share the Techsupport and image details.
@sg893052 I found the issue. It is due to the underlying infrastructure where there is an access to device metadata table while the config reload is done. I added traceback and below is what is seen
May 29 00:02:42.517915 r-spider-05 ERR healthd:
Traceback (most recent call last):#012 File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 490, in system_service#012
self.check_unit_status(event)#012
File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 419, in check_unit_status#012
full_srv_list = self.get_all_service_list()#012
File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 153, in get_all_service_list#012
self.get_service_from_feature_table(dir_list)#012
File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 210, in get_service_from_feature_table#012
device_config.update(device_info.get_device_runtime_metadata())#012
File "/usr/local/lib/python3.9/dist-packages/sonic_py_common/device_info.py", line 618, in get_device_runtime_metadata#012
port_metadata = {'ETHERNET_PORTS_PRESENT': True if get_path_to_port_config_file(hwsku=None, asic="0" if is_multi_npu() else None) else False}#012
File "/usr/local/lib/python3.9/dist-packages/sonic_py_common/device_info.py", line 415, in get_path_to_port_config_file#012
(platform_path, hwsku_path) = get_paths_to_platform_and_hwsku_dirs()#012
File "/usr/local/lib/python3.9/dist-packages/sonic_py_common/device_info.py", line 381, in get_paths_to_platform_and_hwsku_dirs#012
hwsku_path = os.path.join(platform_path, hwsku)#012
File "/usr/lib/python3.9/posixpath.py", line 90, in join#012
genericpath._check_arg_types('join', a, *p)#012 File "/usr/lib/python3.9/genericpath.py", line 152, in _check_arg_types#012
raise TypeError(f'{funcname}() argument must be str, bytes, or '#012TypeError: join() argument must be str, bytes, or os.PathLike object, not 'NoneType'
@abdosi There is a race condition if get_device_runtime_metadata if it is called during config reload. https://github.com/sonic-net/sonic-buildimage/pull/11795 During config reload since config is written to config_db, the device_metadata table might not be available resulting in None and thus a traceback. Can we cache the hwsku or try to handle this gracefully?
@abdosi Can you please check and comment on this issue? @qiluo-msft FYI