sonic-buildimage icon indicating copy to clipboard operation
sonic-buildimage copied to clipboard

[system monitor]ERR healthd: system_servicejoin() argument must be str, bytes, or os.PathLike object, not 'NoneType'

Open dgsudharsan opened this issue 1 year ago • 1 comments

Description

While performing config save followed by config reload sometimes we get the following log

ERR healthd: system_servicejoin() argument must be str, bytes, or os.PathLike object, not 'NoneType'

Steps to reproduce the issue:

  1. config save
  2. config reload -y -f

Describe the results you received:

Error in syslog

Describe the results you expected:

No error in syslog

Output of show version:

SONiC Software Version: SONiC.202311_RC.39-c50d88168_Internal
SONiC OS Version: 11
Distribution: Debian 11.9
Kernel: 5.10.0-23-2-amd64
Build commit: c78ff9d63
Build date: Fri Apr 26 05:01:25 UTC 2024
Built by: sw-r2d2-bot@r-build-sonic-ci03-241

Platform: x86_64-nvidia_sn5600_simx-r0
HwSKU: ACS-SN5600
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2315XZ04ZJ
Model Number: 920-9N42F-00RS-5NA
Hardware Revision: A1
Uptime: 03:36:42 up  1:24,  1 user,  load average: 1.91, 3.44, 2.24
Date: Mon 29 Apr 2024 03:36:42

Docker images:
REPOSITORY                                         TAG                               IMAGE ID       SIZE
docker-dhcp-relay                                  latest                            1a4c76eda529   324MB
docker-platform-monitor                            202311_RC.39-c50d88168_Internal   693addbace38   821MB
docker-platform-monitor                            latest                            693addbace38   821MB
docker-macsec                                      latest                            07000709328f   344MB
docker-orchagent                                   202311_RC.39-c50d88168_Internal   278069786798   353MB
docker-orchagent                                   latest                            278069786798   353MB
docker-eventd                                      202311_RC.39-c50d88168_Internal   af8d08dce832   315MB
docker-eventd                                      latest                            af8d08dce832   315MB
docker-snmp                                        202311_RC.39-c50d88168_Internal   6a51b8d8f606   354MB
docker-snmp                                        latest                            6a51b8d8f606   354MB
docker-nat                                         202311_RC.39-c50d88168_Internal   739b3809fe31   345MB
docker-nat                                         latest                            739b3809fe31   345MB
docker-sflow                                       202311_RC.39-c50d88168_Internal   164f4326030d   343MB
docker-sflow                                       latest                            164f4326030d   343MB
docker-fpm-frr                                     202311_RC.39-c50d88168_Internal   5bd54c2d63e0   373MB
docker-fpm-frr                                     latest                            5bd54c2d63e0   373MB
docker-syncd-mlnx                                  202311_RC.39-c50d88168_Internal   5f8046eaefce   833MB
docker-syncd-mlnx                                  latest                            5f8046eaefce   833MB
docker-teamd                                       202311_RC.39-c50d88168_Internal   f4416035b8f5   342MB
docker-teamd                                       latest                            f4416035b8f5   342MB
docker-sonic-gnmi                                  202311_RC.39-c50d88168_Internal   fe28d796529d   403MB
docker-sonic-gnmi                                  latest                            fe28d796529d   403MB
docker-mux                                         202311_RC.39-c50d88168_Internal   8feaaeda5785   364MB
docker-mux                                         latest                            8feaaeda5785   364MB
docker-lldp                                        202311_RC.39-c50d88168_Internal   ad04c3d79223   357MB
docker-lldp                                        latest                            ad04c3d79223   357MB
docker-database                                    202311_RC.39-c50d88168_Internal   fe6fa16c1643   315MB
docker-database                                    latest                            fe6fa16c1643   315MB
docker-router-advertiser                           202311_RC.39-c50d88168_Internal   2c52659a0d45   315MB
docker-router-advertiser                           latest                            2c52659a0d45   315MB
docker-sonic-mgmt-framework                        202311_RC.39-c50d88168_Internal   a34baf831465   417MB
docker-sonic-mgmt-framework                        latest                            a34baf831465   417MB

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

dgsudharsan avatar Apr 29 '24 16:04 dgsudharsan

@sg893052 @adyeung FYI

dgsudharsan avatar Apr 29 '24 16:04 dgsudharsan

@dgsudharsan @adyeung Found the issue, it is due to EOFError from the queue processing during queue shutdown.

The fix already exists in the master code --> https://github.com/sonic-net/sonic-buildimage/blob/master/src/system-health/health_checker/sysmonitor.py#L485

Please backport it accordingly.

sg893052 avatar May 08 '24 13:05 sg893052

@sg893052 please share the PR in master so we can add the relevant label for the backport.

liat-grozovik avatar May 09 '24 05:05 liat-grozovik

@sg893052 please share the PR in master so we can add the relevant label for the backport. https://github.com/sonic-net/sonic-buildimage/pull/17459 is the PR in master

sg893052 avatar May 09 '24 06:05 sg893052

@sg893052 Even with the PR we see the issue.

dgsudharsan avatar May 13 '24 21:05 dgsudharsan

@sg893052 Even with the PR we see the issue.

@dgsudharsan Please share the Techsupport and image details.

sg893052 avatar May 14 '24 04:05 sg893052

@sg893052 I found the issue. It is due to the underlying infrastructure where there is an access to device metadata table while the config reload is done. I added traceback and below is what is seen

May 29 00:02:42.517915 r-spider-05 ERR healthd: 
Traceback (most recent call last):#012  File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 490, in system_service#012    
self.check_unit_status(event)#012  
File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 419, in check_unit_status#012    
full_srv_list = self.get_all_service_list()#012  
File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 153, in get_all_service_list#012    
self.get_service_from_feature_table(dir_list)#012  
File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 210, in get_service_from_feature_table#012    
device_config.update(device_info.get_device_runtime_metadata())#012  
File "/usr/local/lib/python3.9/dist-packages/sonic_py_common/device_info.py", line 618, in get_device_runtime_metadata#012    
port_metadata = {'ETHERNET_PORTS_PRESENT': True if get_path_to_port_config_file(hwsku=None, asic="0" if is_multi_npu() else None) else False}#012  
File "/usr/local/lib/python3.9/dist-packages/sonic_py_common/device_info.py", line 415, in get_path_to_port_config_file#012    
(platform_path, hwsku_path) = get_paths_to_platform_and_hwsku_dirs()#012  
File "/usr/local/lib/python3.9/dist-packages/sonic_py_common/device_info.py", line 381, in get_paths_to_platform_and_hwsku_dirs#012    
hwsku_path = os.path.join(platform_path, hwsku)#012  
File "/usr/lib/python3.9/posixpath.py", line 90, in join#012    
genericpath._check_arg_types('join', a, *p)#012  File "/usr/lib/python3.9/genericpath.py", line 152, in _check_arg_types#012    
raise TypeError(f'{funcname}() argument must be str, bytes, or '#012TypeError: join() argument must be str, bytes, or os.PathLike object, not 'NoneType'

dgsudharsan avatar May 28 '24 22:05 dgsudharsan

@abdosi There is a race condition if get_device_runtime_metadata if it is called during config reload. https://github.com/sonic-net/sonic-buildimage/pull/11795 During config reload since config is written to config_db, the device_metadata table might not be available resulting in None and thus a traceback. Can we cache the hwsku or try to handle this gracefully?

dgsudharsan avatar May 28 '24 22:05 dgsudharsan

@abdosi Can you please check and comment on this issue? @qiluo-msft FYI

bingwang-ms avatar Jul 01 '24 20:07 bingwang-ms