sonic-buildimage
sonic-buildimage copied to clipboard
hostcfgd race condition with config reload
Description
The issue happens when docker is started by systemd and in the middle of the operation hostcfgd
configures it's desired state.
Steps to reproduce the issue:
- config feature disabled
- config save -y # feature is disabled in config_db.json
- config feature enabled
- config reload -y # load from config_db.json (feature is disabled)
- Observe feature docker is UP but the desired state is disabled according to config DB
Describe the results you received:
Example log for arbitrary service X:
Nov 21 22:24:36.665181 sonic INFO hostcfgd: Running cmd: '['sudo', 'systemctl', 'stop', 'X.service']'
Nov 21 22:24:36.688430 sonic INFO systemd[1]: Stopped X container.
Nov 21 22:24:36.699137 sonic INFO hostcfgd: Running cmd: '['sudo', 'systemctl', 'disable', 'X.service']'
Nov 21 22:24:36.691220 sonic INFO systemd[1]: Starting X service... <===== Start triggered by WantedBy=syncd.service
Nov 21 22:24:36.926058 sonic INFO hostcfgd: Running cmd: '['sudo', 'systemctl', 'mask', 'X.service']'
And this container X remains running as it was started by syncd.service but masked by hostcfgd only after that.
Describe the results you expected:
Feature container does not start.
Ideally, we'd like to see the following boot/config reload flow:
- Configure desired states of services
- Start sonic.target
Therefore, we could eliminate the need of having systemd-sonic-generator and mask_disabled_services.py scripts that configure initial service states.
Need to consider all flows - upgrade, first boot. Ideally, with this approach, service state is synced at very early stage in the boot.
Output of show version
:
(paste your output here)
Output of show techsupport
:
SONiC Software Version: SONiC.202305_RC.36-4e4396e96_Internal
SONiC OS Version: 11
Distribution: Debian 11.8
Kernel: 5.10.0-23-2-amd64
Build commit: 4e4396e96
Build date: Sun Nov 26 09:28:13 UTC 2023
Built by: sw-r2d2-bot@r-build-sonic-ci03-244
Platform: x86_64-mlnx_msn2700-r0
HwSKU: Mellanox-SN2700-D48C8
ASIC: mellanox
ASIC Count: 1
Serial Number: MT1822K07815
Model Number: MSN2700-CS2FO
Hardware Revision: A1
Uptime: 16:52:44 up 1:49, 1 user, load average: 0.57, 0.70, 0.87
Date: Mon 27 Nov 2023 16:52:44
Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-syncd-mlnx 202305_RC.36-4e4396e96_Internal 5fa17071be2a 836MB
docker-syncd-mlnx latest 5fa17071be2a 836MB
docker-platform-monitor 202305_RC.36-4e4396e96_Internal 6bd3faaaaf54 827MB
docker-platform-monitor latest 6bd3faaaaf54 827MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh 1.6.0-202305-12 3b67dd4aebad 433MB
docker-orchagent 202305_RC.36-4e4396e96_Internal dc8a72449afd 328MB
docker-orchagent latest dc8a72449afd 328MB
docker-fpm-frr 202305_RC.36-4e4396e96_Internal e028f1635caa 348MB
docker-fpm-frr latest e028f1635caa 348MB
docker-nat 202305_RC.36-4e4396e96_Internal d26fe14af4fb 320MB
docker-nat latest d26fe14af4fb 320MB
docker-sflow 202305_RC.36-4e4396e96_Internal 469d8a988bab 318MB
docker-sflow latest 469d8a988bab 318MB
docker-teamd 202305_RC.36-4e4396e96_Internal cd8e61bdb85f 317MB
docker-teamd latest cd8e61bdb85f 317MB
docker-macsec 202305_RC.35-4e4396e96_Internal 4c3075927439 319MB
docker-dhcp-relay 202305_RC.35-4e4396e96_Internal 2a276664f14d 307MB
docker-eventd 202305_RC.36-4e4396e96_Internal 1a925ba903eb 299MB
docker-eventd latest 1a925ba903eb 299MB
docker-sonic-telemetry 202305_RC.36-4e4396e96_Internal b9abaa617279 386MB
docker-sonic-telemetry latest b9abaa617279 386MB
docker-snmp 202305_RC.36-4e4396e96_Internal db8e6dcbb985 338MB
docker-snmp latest db8e6dcbb985 338MB
docker-lldp 202305_RC.36-4e4396e96_Internal 7147b2ceb97f 341MB
docker-lldp latest 7147b2ceb97f 341MB
docker-mux 202305_RC.36-4e4396e96_Internal a64edb0e0ecf 348MB
docker-mux latest a64edb0e0ecf 348MB
docker-router-advertiser 202305_RC.36-4e4396e96_Internal 01f823df9295 299MB
docker-router-advertiser latest 01f823df9295 299MB
docker-database 202305_RC.36-4e4396e96_Internal e7ab4d434eff 299MB
docker-database latest e7ab4d434eff 299MB
docker-sonic-mgmt-framework 202305_RC.36-4e4396e96_Internal 9f630d481095 415MB
docker-sonic-mgmt-framework latest 9f630d481095 415MB
Additional information you deem important (e.g. issue happens only occasionally):
@qiluo-msft Can you please help take a look? Thanks!
@stepanblyschak is this kind of change in the sonic design in 202305?
@liat-grozovik
Are you talking about this idea?
Ideally, the we'd like to see the following boot/config reload flow:
Configure desired states of services
Start sonic.target
I'd say it is rather a big change that requires some small design, rather then a simple bug fix, however, per my understanding we can solve a couple of issues at once.
Could you give detailed command lines used in step "config feature disabled" and "config feature enabled"? Is this issue a regression or day one issue?
@qiluo-msft I think it is a day one issue. The commands are regular sonic commands "config feature state disabled" and "config feature state enabled". Are you asking which feature is affected?
@dgsudharsan @vivekrnv Are you able to help resolve this issue?
@dgsudharsan @vivekrnv Are you able to help resolve this issue?
Hi @qiluo-msft I don't think it is trivial. Needs a discussion in subgroup to understand how can we address this.
@prsunny will check on what subgroup meeting we can raise this issue
@prsunny Any update on which subgroup to discuss this issue?
The group name is sonic-common-infra https://lists.sonicfoundation.dev/g/sonic-common-infra . @arlakshm FYI.
following the workgroup discussion, @arlakshm is there a community owner who is taking it?