sonic-buildimage icon indicating copy to clipboard operation
sonic-buildimage copied to clipboard

Mid-December changes to SWSS made SONiC on Dell N3248TE-ON unusable

Open justindthomas opened this issue 11 months ago • 8 comments

I've been running a custom build (with some of my own changes) of the master branch from December 11 on my Dell N3248TE-ON for months because any attempts to use a commit date later than around that time result in the docker-orchagent container periodically dying and taking everything else down with it.

I had assumed it was just because I was trying to be on the bleeding edge and figured it would be resolved eventually. Today I decided to roll back to the "current" release of 202311 from https://sonic.software with a clean configuration so that I could focus on some IPv6 work and not worry about my platform. But the problematic changes to swss seem to have been merged into that branch and I'm seeing the same behavior as when I build on master.

Here is a log of how the swss container fails:

...
2024-03-21 16:21:12,727 INFO success: gearsyncd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2024-03-21 16:21:12,837 INFO exited: gearsyncd (exit status 0; expected)
2024-03-21 16:21:13,462 INFO success: portsyncd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:13,604 INFO spawned: 'orchagent' with pid 53
2024-03-21 16:21:14,817 INFO success: orchagent entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:16,140 INFO spawned: 'swssconfig' with pid 78
2024-03-21 16:21:16,149 INFO success: swssconfig entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2024-03-21 16:21:16,195 INFO spawned: 'coppmgrd' with pid 79
2024-03-21 16:21:16,233 INFO success: coppmgrd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2024-03-21 16:21:21,020 INFO exited: swssconfig (exit status 0; expected)
2024-03-21 16:21:22,458 INFO spawned: 'restore_neighbors' with pid 110
2024-03-21 16:21:22,476 INFO success: restore_neighbors entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2024-03-21 16:21:22,544 INFO spawned: 'arp_update' with pid 111
2024-03-21 16:21:22,662 INFO spawned: 'neighsyncd' with pid 112
2024-03-21 16:21:22,841 INFO spawned: 'wait_for_link' with pid 115
2024-03-21 16:21:22,866 INFO success: wait_for_link entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2024-03-21 16:21:23,012 INFO spawned: 'vlanmgrd' with pid 116
2024-03-21 16:21:23,297 INFO spawned: 'intfmgrd' with pid 120
2024-03-21 16:21:23,497 INFO spawned: 'portmgrd' with pid 121
2024-03-21 16:21:23,565 INFO success: arp_update entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:23,626 INFO spawned: 'buffermgrd' with pid 124
2024-03-21 16:21:23,662 INFO success: neighsyncd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:23,774 INFO spawned: 'enable_counters' with pid 127
2024-03-21 16:21:23,870 INFO spawned: 'vrfmgrd' with pid 137
2024-03-21 16:21:23,920 INFO spawned: 'nbrmgrd' with pid 139
2024-03-21 16:21:24,033 INFO success: vlanmgrd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:24,135 INFO spawned: 'vxlanmgrd' with pid 146
2024-03-21 16:21:24,243 INFO spawned: 'fdbsyncd' with pid 149
2024-03-21 16:21:24,272 INFO success: intfmgrd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:24,410 INFO spawned: 'tunnelmgrd' with pid 155
2024-03-21 16:21:24,474 INFO success: portmgrd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:24,601 INFO success: buffermgrd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:24,763 INFO success: enable_counters entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:24,845 INFO success: vrfmgrd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:24,919 INFO success: nbrmgrd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:25,077 INFO success: vxlanmgrd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:25,221 INFO success: fdbsyncd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:25,350 INFO success: tunnelmgrd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:25,593 INFO exited: wait_for_link (exit status 0; expected)
2024-03-21 16:21:26,517 INFO spawned: 'wait_for_link' with pid 264
2024-03-21 16:21:26,536 INFO success: wait_for_link entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2024-03-21 16:21:28,007 INFO exited: wait_for_link (exit status 0; expected)
2024-03-21 16:21:28,727 INFO spawned: 'wait_for_link' with pid 276
2024-03-21 16:21:28,734 INFO success: wait_for_link entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2024-03-21 16:21:29,667 INFO exited: wait_for_link (exit status 0; expected)
2024-03-21 16:21:36,909 INFO exited: restore_neighbors (exit status 0; expected)
2024-03-21 16:21:41,009 INFO spawned: 'ndppd' with pid 390
2024-03-21 16:21:42,013 INFO success: ndppd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-21 16:21:43,529 INFO exited: dependent-startup (exit status 0; expected)
2024-03-21 16:24:24,597 INFO exited: enable_counters (exit status 0; expected)
2024-03-21 16:25:18,814 INFO exited: orchagent (terminated by SIGABRT (core dumped); not expected)
2024-03-21 16:25:19,825 WARN received SIGTERM indicating exit request
2024-03-21 16:25:19,826 INFO waiting for supervisor-proc-exit-listener, rsyslogd, portsyncd, coppmgrd, arp_update, ndppd, neighsyncd, vlanmgrd, intfmgrd, portmgrd, buffermgrd, vrfmgrd, nbrmgrd, vxlanmgrd, fdbsyncd, tunnelmgrd to die
2024-03-21 16:25:19,831 INFO stopped: tunnelmgrd (terminated by SIGTERM)
2024-03-21 16:25:19,839 INFO stopped: fdbsyncd (terminated by SIGTERM)
2024-03-21 16:25:19,845 INFO stopped: vxlanmgrd (terminated by SIGTERM)
2024-03-21 16:25:19,850 INFO stopped: nbrmgrd (terminated by SIGTERM)
2024-03-21 16:25:19,854 INFO stopped: vrfmgrd (terminated by SIGTERM)
2024-03-21 16:25:20,859 INFO stopped: buffermgrd (terminated by SIGTERM)
2024-03-21 16:25:20,862 INFO stopped: portmgrd (terminated by SIGTERM)
2024-03-21 16:25:20,866 INFO stopped: intfmgrd (terminated by SIGTERM)
2024-03-21 16:25:20,868 INFO stopped: vlanmgrd (terminated by SIGTERM)
2024-03-21 16:25:21,872 INFO stopped: neighsyncd (terminated by SIGTERM)
2024-03-21 16:25:21,876 INFO stopped: ndppd (exit status 0)
2024-03-21 16:25:21,878 INFO stopped: arp_update (terminated by SIGTERM)
2024-03-21 16:25:21,882 INFO stopped: coppmgrd (terminated by SIGTERM)
2024-03-21 16:25:22,884 INFO waiting for supervisor-proc-exit-listener, rsyslogd, portsyncd to die
2024-03-21 16:25:23,888 INFO stopped: portsyncd (terminated by SIGTERM)
2024-03-21 16:25:25,896 INFO waiting for supervisor-proc-exit-listener, rsyslogd to die
2024-03-21 16:25:29,455 INFO waiting for supervisor-proc-exit-listener, rsyslogd to die
2024-03-21 16:25:33,046 INFO waiting for supervisor-proc-exit-listener, rsyslogd to die
2024-03-21 16:25:35,044 WARN killing 'rsyslogd' (41) with SIGKILL
2024-03-21 16:25:35,046 INFO stopped: rsyslogd (terminated by SIGKILL)
2024-03-21 16:25:35,051 INFO stopped: supervisor-proc-exit-listener (terminated by SIGTERM)
admin@sonic:~$ Shared connection to 10.100.0.2 closed.

From there, the system tries to restart everything, but the whole thing just cycles from failure to failure. Note that this starts a few minutes after the system has come up and is successfully passing traffic.

This is the version I'm running:

admin@sonic:~$ sudo show ver

SONiC Software Version: SONiC.202311.503318-8e0ce727a
SONiC OS Version: 11
Distribution: Debian 11.9
Kernel: 5.10.0-23-2-amd64
Build commit: 8e0ce727a
Build date: Wed Mar 20 12:43:22 UTC 2024
Built by: AzDevOps@vmss-soni003B4G

Platform: x86_64-dellemc_n3248te_c3338-r0
HwSKU: DellEMC-N3248TE
ASIC: broadcom
ASIC Count: 1
Serial Number: 4GNXV43
Model Number: 0WNWT9
Hardware Revision:
Uptime: 16:31:55 up 11 min,  2 users,  load average: 2.48, 2.54, 1.62
Date: Thu 21 Mar 2024 16:31:55

Docker images:
REPOSITORY                    TAG                       IMAGE ID       SIZE
docker-gbsyncd-broncos        202311.503318-8e0ce727a   6e24b2fbe0aa   351MB
docker-gbsyncd-broncos        latest                    6e24b2fbe0aa   351MB
docker-gbsyncd-credo          202311.503318-8e0ce727a   509554408761   324MB
docker-gbsyncd-credo          latest                    509554408761   324MB
docker-syncd-brcm             202311.503318-8e0ce727a   4d41bc2ec83a   715MB
docker-syncd-brcm             latest                    4d41bc2ec83a   715MB
docker-orchagent              202311.503318-8e0ce727a   10581fe64884   339MB
docker-orchagent              latest                    10581fe64884   339MB
docker-fpm-frr                202311.503318-8e0ce727a   5dec19056997   359MB
docker-fpm-frr                latest                    5dec19056997   359MB
docker-nat                    202311.503318-8e0ce727a   121fbe0018fc   330MB
docker-nat                    latest                    121fbe0018fc   330MB
docker-sflow                  202311.503318-8e0ce727a   81ed6c583e1c   329MB
docker-sflow                  latest                    81ed6c583e1c   329MB
docker-teamd                  202311.503318-8e0ce727a   b4c29deb0605   327MB
docker-teamd                  latest                    b4c29deb0605   327MB
docker-macsec                 latest                    a901182c73ab   329MB
docker-platform-monitor       202311.503318-8e0ce727a   af9df86136ea   421MB
docker-platform-monitor       latest                    af9df86136ea   421MB
docker-snmp                   202311.503318-8e0ce727a   83fc30be02a0   340MB
docker-snmp                   latest                    83fc30be02a0   340MB
docker-dhcp-relay             latest                    9ef4cb1ab6d8   310MB
docker-eventd                 202311.503318-8e0ce727a   ba53a6c2a513   301MB
docker-eventd                 latest                    ba53a6c2a513   301MB
docker-mux                    202311.503318-8e0ce727a   5188a2c9e521   349MB
docker-mux                    latest                    5188a2c9e521   349MB
docker-lldp                   202311.503318-8e0ce727a   dfd9b9b2bfd2   343MB
docker-lldp                   latest                    dfd9b9b2bfd2   343MB
docker-sonic-gnmi             202311.503318-8e0ce727a   b1df84b4cefb   389MB
docker-sonic-gnmi             latest                    b1df84b4cefb   389MB
docker-database               202311.503318-8e0ce727a   7055e54e5f0d   301MB
docker-database               latest                    7055e54e5f0d   301MB
docker-router-advertiser      202311.503318-8e0ce727a   ee1417459cab   301MB
docker-router-advertiser      latest                    ee1417459cab   301MB
docker-sonic-mgmt-framework   202311.503318-8e0ce727a   258548849abd   416MB
docker-sonic-mgmt-framework   latest                    258548849abd   416MB

justindthomas avatar Mar 21 '24 16:03 justindthomas

@jeff-yin This might need attention from Dell. I saw a comment on another issue where the symptom is similar (multi-container failure on Dell Broadcom(Trident) units) although the direct cause may be different (I'm not using subinterfaces).

https://github.com/sonic-net/sonic-buildimage/issues/18237#issuecomment-1998700675

justindthomas avatar Mar 26 '24 01:03 justindthomas

@vpsubramaniam please take a look and self-assign this issue to yourself.

jeff-yin avatar Mar 26 '24 22:03 jeff-yin

@dgsudharsan would you be able to work with @prsunny to ensure swss does not crash on supported SAI call?

prgeor avatar Mar 27 '24 15:03 prgeor

I still have this image installed as a secondary on my switch and can gather more logs if you need. This was installed from the 202311 base image with the following configuration entered manually:

  • Creation of a dozen or so VLANs
  • Creation and assignment of a handful of IPv4 and IPv6 addresses
  • Assignment of VLANs to physical interfaces
  • Enabling the dhcp-relay feature and assignment of that to a couple of VLAN interfaces
  • Adjusting the docker routing config mode split parameter in config_db.json to allow persistence of configuration edited via vtysh
  • Implementing basic BGP configuration to share routes with my border router

But this is the same behavior I see on this platform regardless of configuration (even using a build with some significant changes related to OSPF management and DB-integrated routing configuration layered on top). Ever since that point in mid-December, all builds display this cascading container failure. That set of functionality I employ is consistent (VLANs, dhcp-relay, BGP, etc.)

justindthomas avatar Mar 27 '24 16:03 justindthomas

Due to merges like https://github.com/sonic-net/sonic-buildimage/pull/18038, I can't build from commits as far back as 12/2023 anymore (the files referenced in the older commits are no longer available). I've tried cherry-picking some commits to see if I can get the updated URLs merged without whatever SWSS changes (presumably) are causing the failures, but I haven't been successful yet.

I'm going to try going back further to the 202305 branch and see if that's currently stable on this platform. It looks to me like maybe the 202305 branch lags master more than 202311 (i.e., not as much stuff is back-ported). Hopefully that's accurate.

justindthomas avatar Mar 28 '24 17:03 justindthomas

202305 seems to be stable on the N3248TE platform, so the changes that are causing problems in 202311 and master were not backported to 202305.

justindthomas avatar Apr 01 '24 04:04 justindthomas

@justindthomas, Below image seems fine, all docker services come up without any issues. https://dev.azure.com/mssonic/be1b070f-be15-4154-aade-b1d3bfb17054/_build/results?buildId=508766

Probably something got fixed in the latest 202311 branch, please check this image and if you still see any issues kindly share the configuration details.

vpsubramaniam avatar Apr 09 '24 06:04 vpsubramaniam

@vpsubramaniam Okay, I'll try loading up the current 202311 image tomorrow. My suspicion is that the failure is in something that's activated by the configuration (e.g., maybe the activation of BGP). Hopefully it's fixed, though. I'll report back.

justindthomas avatar Apr 12 '24 00:04 justindthomas