[System ready][ZTP] If system health service starts after ZTP exits in disabled state, sysready status is shown as down.
Description
When ZTP is disabled if system-health services starts after sonic-ztp exit, system ready is show as no ready with ZTP shown as down.
If system-health starts before ZTP exit this issue is not seen
Good state
Apr 27 03:29:21.780881 r-lionfish-16 NOTICE healthd[8081]: Starting up...
Apr 27 03:29:29.479140 sonic INFO sonic-ztp[9050]: ZTP is administratively disabled.
Apr 27 03:30:53.115904 sonic NOTICE healthd: System is ready
redis-cli -n 6 hgetall "ALL_SERVICE_STATUS|ztp"
1) "app_ready_status"
2) "OK"
3) "fail_reason"
4) "-"
5) "service_status"
6) "OK"
7) "update_time"
8) "-"
redis-cli -n 6 hgetall "SYSTEM_READY|SYSTEM_STATE"
1) "Status"
2) "UP"
Issue state
Apr 26 01:25:34.208843 r-tigon-17 INFO sonic-ztp[8798]: ZTP is administratively disabled.
Apr 26 01:25:34.229295 r-tigon-17 NOTICE healthd[9964]: Starting up...
"ALL_SERVICE_STATUS|ztp": {
"expireat": 1714084454.4621081,
"ttl": -0.001,
"type": "hash",
"value": {
"app_ready_status": "Down",
"fail_reason": "Inactive",
"service_status": "Down",
"update_time": "-"
}
},
In both scenarios ZTP is disabled
root@r-lionfish-16:~# show ztp status
ZTP Admin Mode : False
ZTP Service : Inactive
ZTP Status : Not Started
ZTP Service is not running
root@r-lionfish-16:~#
root@r-lionfish-16:~# service ztp status
● ztp.service - SONiC Zero Touch Provisioning service
Loaded: loaded (/lib/systemd/system/ztp.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Sat 2024-04-27 03:29:29 IDT; 31min ago
Main PID: 9049 (code=exited, status=0/SUCCESS)
Apr 27 03:30:47 r-lionfish-16 systemd[1]: /lib/systemd/system/ztp.service:9: Standard output type syslog is obsolete, automatically updating to journal. Ple>
Apr 27 03:30:47 r-lionfish-16 systemd[1]: /lib/systemd/system/ztp.service:10: Standard output type syslog+console is obsolete, automatically updating to jou>
Apr 27 03:30:48 r-lionfish-16 systemd[1]: /lib/systemd/system/ztp.service:9: Standard output type syslog is obsolete, automatically updating to journal. Ple>
Apr 27 03:30:48 r-lionfish-16 systemd[1]: /lib/systemd/system/ztp.service:10: Standard output type syslog+console is obsolete, automatically updating to jou>
Apr 27 03:30:48 r-lionfish-16 systemd[1]: /lib/systemd/system/ztp.service:9: Standard output type syslog is obsolete, automatically updating to journal. Ple>
Apr 27 03:30:48 r-lionfish-16 systemd[1]: /lib/systemd/system/ztp.service:10: Standard output type syslog+console is obsolete, automatically updating to jou>
Apr 27 03:30:51 r-lionfish-16 systemd[1]: /lib/systemd/system/ztp.service:9: Standard output type syslog is obsolete, automatically updating to journal. Ple>
Apr 27 03:30:51 r-lionfish-16 systemd[1]: /lib/systemd/system/ztp.service:10: Standard output type syslog+console is obsolete, automatically updating to jou>
Apr 27 03:30:51 r-lionfish-16 systemd[1]: /lib/systemd/system/ztp.service:9: Standard output type syslog is obsolete, automatically updating to journal. Ple>
Apr 27 03:30:51 r-lionfish-16 systemd[1]: /lib/systemd/system/ztp.service:10: Standard output type syslog+console is obsolete, automatically updating to jou>
This issue can be reproduced easily even if ztp starts after healthd. Restarting system-health service will result in problem state
root@r-lionfish-16:~# show system-health sysready-status
System is ready
Service-Name Service-Status App-Ready-Status Down-Reason
---------------------- ---------------- ------------------ -------------
auditd OK OK -
bgp OK OK -
caclmgrd OK OK -
config-chassisdb OK OK -
config-setup OK OK -
containerd OK OK -
cron OK OK -
database OK OK -
determine-reboot-cause OK OK -
docker OK OK -
eventd OK OK -
gnmi OK OK -
hw-management OK OK -
hw-management-tc OK OK -
kdump-tools OK OK -
lldp OK OK -
lm-sensors OK OK -
mgmt-framework OK OK -
netfilter-persistent OK OK -
ntp OK OK -
nv-syncd-shared OK OK -
pmon OK OK -
procdockerstatsd OK OK -
radv OK OK -
ras-mc-ctl OK OK -
rsyslog OK OK -
smartmontools OK OK -
snmp OK OK -
ssh OK OK -
swss OK OK -
syncd OK OK -
sysstat OK OK -
teamd OK OK -
what-just-happened OK OK -
ztp OK OK -
root@r-lionfish-16:~#
root@r-lionfish-16:~#
root@r-lionfish-16:~# service system-health restart
root@r-lionfish-16:~#
root@r-lionfish-16:~# show system-health sysready-status
System is not ready - one or more services are not up
Service-Name Service-Status App-Ready-Status Down-Reason
---------------------- ---------------- ------------------ -------------
auditd OK OK -
bgp OK OK -
caclmgrd OK OK -
config-chassisdb OK OK -
config-setup OK OK -
containerd OK OK -
cron OK OK -
database OK OK -
determine-reboot-cause OK OK -
docker OK OK -
eventd OK OK -
gnmi OK OK -
hw-management OK OK -
hw-management-tc OK OK -
kdump-tools OK OK -
lldp OK OK -
lm-sensors OK OK -
mgmt-framework OK OK -
netfilter-persistent OK OK -
ntp OK OK -
nv-syncd-shared OK OK -
pmon OK OK -
procdockerstatsd OK OK -
radv OK OK -
ras-mc-ctl OK OK -
rsyslog OK OK -
smartmontools OK OK -
snmp OK OK -
ssh OK OK -
swss OK OK -
syncd OK OK -
sysstat OK OK -
teamd OK OK -
what-just-happened OK OK -
ztp Down Down Inactive
Steps to reproduce the issue:
- Disable ZTP
- Reboot system
- Restart system health service
Describe the results you received:
System is shown as not ready
Describe the results you expected:
System should be in ready state as ztp is administratively disabled.
Output of show version:
show version
SONiC Software Version: SONiC.202311_RC.39-c50d88168_Internal_ASAN
SONiC OS Version: 11
Distribution: Debian 11.9
Kernel: 5.10.0-23-2-amd64
Build commit: 1c7a9fb01
Build date: Fri Apr 26 05:36:05 UTC 2024
Built by: sw-r2d2-bot@r-build-sonic-ci03-244
Platform: x86_64-mlnx_msn3420-r0
HwSKU: ACS-MSN3420
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2019X13878
Model Number: MSN3420-CB2FO
Hardware Revision: A1
Uptime: 04:03:48 up 34 min, 1 user, load average: 0.41, 0.47, 0.46
Date: Sat 27 Apr 2024 04:03:48
Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-orchagent 202311_RC.39-c50d88168_Internal_ASAN ea7c8629834a 552MB
docker-orchagent latest ea7c8629834a 552MB
docker-syncd-mlnx 202311_RC.39-c50d88168_Internal_ASAN f662d69a28a0 867MB
docker-syncd-mlnx latest f662d69a28a0 867MB
docker-teamd 202311_RC.39-c50d88168_Internal_ASAN 3ede630da1bb 389MB
docker-teamd latest 3ede630da1bb 389MB
docker-sflow 202311_RC.39-c50d88168_Internal_ASAN 53d7637749d2 390MB
docker-sflow latest 53d7637749d2 390MB
docker-platform-monitor 202311_RC.39-c50d88168_Internal_ASAN 3a97dbe9c972 821MB
docker-platform-monitor latest 3a97dbe9c972 821MB
docker-fpm-frr 202311_RC.39-c50d88168_Internal_ASAN 7757dd696268 420MB
docker-fpm-frr latest 7757dd696268 420MB
docker-dhcp-relay latest ef76a9aad7cc 324MB
docker-nat 202311_RC.39-c50d88168_Internal_ASAN 250559162cc8 392MB
docker-nat latest 250559162cc8 392MB
docker-snmp 202311_RC.39-c50d88168_Internal_ASAN a279906b3fcb 354MB
docker-snmp latest a279906b3fcb 354MB
docker-macsec latest 5643e32d9756 391MB
docker-eventd 202311_RC.39-c50d88168_Internal_ASAN ee088c601422 315MB
docker-eventd latest ee088c601422 315MB
docker-lldp 202311_RC.39-c50d88168_Internal_ASAN 5a9a70bc2b26 357MB
docker-lldp latest 5a9a70bc2b26 357MB
docker-sonic-gnmi 202311_RC.39-c50d88168_Internal_ASAN 7686e896871c 403MB
docker-sonic-gnmi latest 7686e896871c 403MB
docker-database 202311_RC.39-c50d88168_Internal_ASAN 0a98bb5bc3aa 315MB
docker-database latest 0a98bb5bc3aa 315MB
docker-mux 202311_RC.39-c50d88168_Internal_ASAN 66df1fc03c88 364MB
docker-mux latest 66df1fc03c88 364MB
docker-router-advertiser 202311_RC.39-c50d88168_Internal_ASAN 5700737fe03f 315MB
docker-router-advertiser latest 5700737fe03f 315MB
docker-sonic-mgmt-framework 202311_RC.39-c50d88168_Internal_ASAN 0173e0ad3c90 417MB
docker-sonic-mgmt-framework latest 0173e0ad3c90 417MB```
#### Output of `show techsupport`:
(paste your output here or download and attach the file here )
#### Additional information you deem important (e.g. issue happens only occasionally):
<!--
Also attach debug file produced by `sudo generate_dump`
-->
@adyeung @sg893052 @rajendra-dendukuri FYI This issue is blocking in some scenarios as sflow depends on system ready, else will wait for 3 minutes. Please refer to https://github.com/sonic-net/SONiC/pull/1627 .This results in sflow test failure.
@sflow FYI
@dgsudharsan @Junchao-Mellanox We could consider to ignore the ztp service for system ready. Sysmonitor has the logic in place to skip the services mentioned in the platform specific system_health configuration file under "services_to_ignore" field list.
/usr/share/sonic/device/{platform_name}/system_health_monitoring_config.json
{
"services_to_ignore": ["ztp.service"],
"devices_to_ignore": [],
"user_defined_checkers": [],
"polling_interval": 60,
"led_color": {
"fault": "amber",
"normal": "green",
"booting": "orange_blink"
}
}
@sg893052 This is not a platform specific issue and would occur in any platform since ZTP is common service. I prefer not adding this to platform directory. This needs to be handled in health monitor. For feature table we check if the feature is enabled or disabled and only consider it for system monitoring. Same should be done for ZTP through special handling
@dgsudharsan @adyeung https://github.com/sonic-net/sonic-buildimage/pull/18911 is the PR raised to address this issue.