sonic-buildimage
sonic-buildimage copied to clipboard
Mid-December changes to SWSS made SONiC on Dell N3248TE-ON unusable
I've been running a custom build (with some of my own changes) of the master branch from December 11 on my Dell N3248TE-ON for months because any attempts to use a commit date later than around that time result in the docker-orchagent
container periodically dying and taking everything else down with it.
I had assumed it was just because I was trying to be on the bleeding edge and figured it would be resolved eventually. Today I decided to roll back to the "current" release of 202311 from https://sonic.software with a clean configuration so that I could focus on some IPv6 work and not worry about my platform. But the problematic changes to swss
seem to have been merged into that branch and I'm seeing the same behavior as when I build on master.
Here is a log of how the swss
container fails:
...
2024-03-21 16:21:12,727 INFO success: gearsyncd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2024-03-21 16:21:12,837 INFO exited: gearsyncd (exit status 0; expected)
2024-03-21 16:21:13,462 INFO success: portsyncd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:13,604 INFO spawned: 'orchagent' with pid 53
2024-03-21 16:21:14,817 INFO success: orchagent entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:16,140 INFO spawned: 'swssconfig' with pid 78
2024-03-21 16:21:16,149 INFO success: swssconfig entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2024-03-21 16:21:16,195 INFO spawned: 'coppmgrd' with pid 79
2024-03-21 16:21:16,233 INFO success: coppmgrd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2024-03-21 16:21:21,020 INFO exited: swssconfig (exit status 0; expected)
2024-03-21 16:21:22,458 INFO spawned: 'restore_neighbors' with pid 110
2024-03-21 16:21:22,476 INFO success: restore_neighbors entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2024-03-21 16:21:22,544 INFO spawned: 'arp_update' with pid 111
2024-03-21 16:21:22,662 INFO spawned: 'neighsyncd' with pid 112
2024-03-21 16:21:22,841 INFO spawned: 'wait_for_link' with pid 115
2024-03-21 16:21:22,866 INFO success: wait_for_link entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2024-03-21 16:21:23,012 INFO spawned: 'vlanmgrd' with pid 116
2024-03-21 16:21:23,297 INFO spawned: 'intfmgrd' with pid 120
2024-03-21 16:21:23,497 INFO spawned: 'portmgrd' with pid 121
2024-03-21 16:21:23,565 INFO success: arp_update entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:23,626 INFO spawned: 'buffermgrd' with pid 124
2024-03-21 16:21:23,662 INFO success: neighsyncd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:23,774 INFO spawned: 'enable_counters' with pid 127
2024-03-21 16:21:23,870 INFO spawned: 'vrfmgrd' with pid 137
2024-03-21 16:21:23,920 INFO spawned: 'nbrmgrd' with pid 139
2024-03-21 16:21:24,033 INFO success: vlanmgrd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:24,135 INFO spawned: 'vxlanmgrd' with pid 146
2024-03-21 16:21:24,243 INFO spawned: 'fdbsyncd' with pid 149
2024-03-21 16:21:24,272 INFO success: intfmgrd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:24,410 INFO spawned: 'tunnelmgrd' with pid 155
2024-03-21 16:21:24,474 INFO success: portmgrd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:24,601 INFO success: buffermgrd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:24,763 INFO success: enable_counters entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:24,845 INFO success: vrfmgrd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:24,919 INFO success: nbrmgrd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:25,077 INFO success: vxlanmgrd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:25,221 INFO success: fdbsyncd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:25,350 INFO success: tunnelmgrd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:25,593 INFO exited: wait_for_link (exit status 0; expected)
2024-03-21 16:21:26,517 INFO spawned: 'wait_for_link' with pid 264
2024-03-21 16:21:26,536 INFO success: wait_for_link entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2024-03-21 16:21:28,007 INFO exited: wait_for_link (exit status 0; expected)
2024-03-21 16:21:28,727 INFO spawned: 'wait_for_link' with pid 276
2024-03-21 16:21:28,734 INFO success: wait_for_link entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2024-03-21 16:21:29,667 INFO exited: wait_for_link (exit status 0; expected)
2024-03-21 16:21:36,909 INFO exited: restore_neighbors (exit status 0; expected)
2024-03-21 16:21:41,009 INFO spawned: 'ndppd' with pid 390
2024-03-21 16:21:42,013 INFO success: ndppd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:43,529 INFO exited: dependent-startup (exit status 0; expected)
2024-03-21 16:24:24,597 INFO exited: enable_counters (exit status 0; expected)
2024-03-21 16:25:18,814 INFO exited: orchagent (terminated by SIGABRT (core dumped); not expected)
2024-03-21 16:25:19,825 WARN received SIGTERM indicating exit request
2024-03-21 16:25:19,826 INFO waiting for supervisor-proc-exit-listener, rsyslogd, portsyncd, coppmgrd, arp_update, ndppd, neighsyncd, vlanmgrd, intfmgrd, portmgrd, buffermgrd, vrfmgrd, nbrmgrd, vxlanmgrd, fdbsyncd, tunnelmgrd to die
2024-03-21 16:25:19,831 INFO stopped: tunnelmgrd (terminated by SIGTERM)
2024-03-21 16:25:19,839 INFO stopped: fdbsyncd (terminated by SIGTERM)
2024-03-21 16:25:19,845 INFO stopped: vxlanmgrd (terminated by SIGTERM)
2024-03-21 16:25:19,850 INFO stopped: nbrmgrd (terminated by SIGTERM)
2024-03-21 16:25:19,854 INFO stopped: vrfmgrd (terminated by SIGTERM)
2024-03-21 16:25:20,859 INFO stopped: buffermgrd (terminated by SIGTERM)
2024-03-21 16:25:20,862 INFO stopped: portmgrd (terminated by SIGTERM)
2024-03-21 16:25:20,866 INFO stopped: intfmgrd (terminated by SIGTERM)
2024-03-21 16:25:20,868 INFO stopped: vlanmgrd (terminated by SIGTERM)
2024-03-21 16:25:21,872 INFO stopped: neighsyncd (terminated by SIGTERM)
2024-03-21 16:25:21,876 INFO stopped: ndppd (exit status 0)
2024-03-21 16:25:21,878 INFO stopped: arp_update (terminated by SIGTERM)
2024-03-21 16:25:21,882 INFO stopped: coppmgrd (terminated by SIGTERM)
2024-03-21 16:25:22,884 INFO waiting for supervisor-proc-exit-listener, rsyslogd, portsyncd to die
2024-03-21 16:25:23,888 INFO stopped: portsyncd (terminated by SIGTERM)
2024-03-21 16:25:25,896 INFO waiting for supervisor-proc-exit-listener, rsyslogd to die
2024-03-21 16:25:29,455 INFO waiting for supervisor-proc-exit-listener, rsyslogd to die
2024-03-21 16:25:33,046 INFO waiting for supervisor-proc-exit-listener, rsyslogd to die
2024-03-21 16:25:35,044 WARN killing 'rsyslogd' (41) with SIGKILL
2024-03-21 16:25:35,046 INFO stopped: rsyslogd (terminated by SIGKILL)
2024-03-21 16:25:35,051 INFO stopped: supervisor-proc-exit-listener (terminated by SIGTERM)
admin@sonic:~$ Shared connection to 10.100.0.2 closed.
From there, the system tries to restart everything, but the whole thing just cycles from failure to failure. Note that this starts a few minutes after the system has come up and is successfully passing traffic.
This is the version I'm running:
admin@sonic:~$ sudo show ver
SONiC Software Version: SONiC.202311.503318-8e0ce727a
SONiC OS Version: 11
Distribution: Debian 11.9
Kernel: 5.10.0-23-2-amd64
Build commit: 8e0ce727a
Build date: Wed Mar 20 12:43:22 UTC 2024
Built by: AzDevOps@vmss-soni003B4G
Platform: x86_64-dellemc_n3248te_c3338-r0
HwSKU: DellEMC-N3248TE
ASIC: broadcom
ASIC Count: 1
Serial Number: 4GNXV43
Model Number: 0WNWT9
Hardware Revision:
Uptime: 16:31:55 up 11 min, 2 users, load average: 2.48, 2.54, 1.62
Date: Thu 21 Mar 2024 16:31:55
Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-gbsyncd-broncos 202311.503318-8e0ce727a 6e24b2fbe0aa 351MB
docker-gbsyncd-broncos latest 6e24b2fbe0aa 351MB
docker-gbsyncd-credo 202311.503318-8e0ce727a 509554408761 324MB
docker-gbsyncd-credo latest 509554408761 324MB
docker-syncd-brcm 202311.503318-8e0ce727a 4d41bc2ec83a 715MB
docker-syncd-brcm latest 4d41bc2ec83a 715MB
docker-orchagent 202311.503318-8e0ce727a 10581fe64884 339MB
docker-orchagent latest 10581fe64884 339MB
docker-fpm-frr 202311.503318-8e0ce727a 5dec19056997 359MB
docker-fpm-frr latest 5dec19056997 359MB
docker-nat 202311.503318-8e0ce727a 121fbe0018fc 330MB
docker-nat latest 121fbe0018fc 330MB
docker-sflow 202311.503318-8e0ce727a 81ed6c583e1c 329MB
docker-sflow latest 81ed6c583e1c 329MB
docker-teamd 202311.503318-8e0ce727a b4c29deb0605 327MB
docker-teamd latest b4c29deb0605 327MB
docker-macsec latest a901182c73ab 329MB
docker-platform-monitor 202311.503318-8e0ce727a af9df86136ea 421MB
docker-platform-monitor latest af9df86136ea 421MB
docker-snmp 202311.503318-8e0ce727a 83fc30be02a0 340MB
docker-snmp latest 83fc30be02a0 340MB
docker-dhcp-relay latest 9ef4cb1ab6d8 310MB
docker-eventd 202311.503318-8e0ce727a ba53a6c2a513 301MB
docker-eventd latest ba53a6c2a513 301MB
docker-mux 202311.503318-8e0ce727a 5188a2c9e521 349MB
docker-mux latest 5188a2c9e521 349MB
docker-lldp 202311.503318-8e0ce727a dfd9b9b2bfd2 343MB
docker-lldp latest dfd9b9b2bfd2 343MB
docker-sonic-gnmi 202311.503318-8e0ce727a b1df84b4cefb 389MB
docker-sonic-gnmi latest b1df84b4cefb 389MB
docker-database 202311.503318-8e0ce727a 7055e54e5f0d 301MB
docker-database latest 7055e54e5f0d 301MB
docker-router-advertiser 202311.503318-8e0ce727a ee1417459cab 301MB
docker-router-advertiser latest ee1417459cab 301MB
docker-sonic-mgmt-framework 202311.503318-8e0ce727a 258548849abd 416MB
docker-sonic-mgmt-framework latest 258548849abd 416MB
@jeff-yin This might need attention from Dell. I saw a comment on another issue where the symptom is similar (multi-container failure on Dell Broadcom(Trident) units) although the direct cause may be different (I'm not using subinterfaces).
https://github.com/sonic-net/sonic-buildimage/issues/18237#issuecomment-1998700675
@vpsubramaniam please take a look and self-assign this issue to yourself.
@dgsudharsan would you be able to work with @prsunny to ensure swss does not crash on supported SAI call?
I still have this image installed as a secondary on my switch and can gather more logs if you need. This was installed from the 202311 base image with the following configuration entered manually:
- Creation of a dozen or so VLANs
- Creation and assignment of a handful of IPv4 and IPv6 addresses
- Assignment of VLANs to physical interfaces
- Enabling the dhcp-relay feature and assignment of that to a couple of VLAN interfaces
- Adjusting the docker routing config mode split parameter in config_db.json to allow persistence of configuration edited via vtysh
- Implementing basic BGP configuration to share routes with my border router
But this is the same behavior I see on this platform regardless of configuration (even using a build with some significant changes related to OSPF management and DB-integrated routing configuration layered on top). Ever since that point in mid-December, all builds display this cascading container failure. That set of functionality I employ is consistent (VLANs, dhcp-relay, BGP, etc.)
Due to merges like https://github.com/sonic-net/sonic-buildimage/pull/18038, I can't build from commits as far back as 12/2023 anymore (the files referenced in the older commits are no longer available). I've tried cherry-picking some commits to see if I can get the updated URLs merged without whatever SWSS changes (presumably) are causing the failures, but I haven't been successful yet.
I'm going to try going back further to the 202305 branch and see if that's currently stable on this platform. It looks to me like maybe the 202305 branch lags master more than 202311 (i.e., not as much stuff is back-ported). Hopefully that's accurate.
202305 seems to be stable on the N3248TE platform, so the changes that are causing problems in 202311 and master were not backported to 202305.
@justindthomas, Below image seems fine, all docker services come up without any issues. https://dev.azure.com/mssonic/be1b070f-be15-4154-aade-b1d3bfb17054/_build/results?buildId=508766
Probably something got fixed in the latest 202311 branch, please check this image and if you still see any issues kindly share the configuration details.
@vpsubramaniam Okay, I'll try loading up the current 202311 image tomorrow. My suspicion is that the failure is in something that's activated by the configuration (e.g., maybe the activation of BGP). Hopefully it's fixed, though. I'll report back.