sonic-swss
sonic-swss copied to clipboard
On teammgrd/teamsyncd exits, return EXIT_FAILURE
What I did When teammgrd/teamsyncd exits -- return FAILURE so that supervisord catch it and teamd docker is restarted.
Why I did it Fixes https://github.com/Azure/sonic-buildimage/issues/10534
I have seen this in builds from 201911 to master.
How I verified it Checked by sending SIGTERM to teamsyncd/teammgrd processes
Apr 15 21:57:38.156111 str-a7280cr3-2 INFO teamd#supervisord 2022-04-15 21:57:38,155 INFO exited: teamsyncd (exit status 0; expected)
Apr 15 22:20:09.530223 str-a7280cr3-2 INFO teamd#supervisord 2022-04-15 22:20:09,529 INFO exited: teammgrd (exit status 0; expected)
-- with fix
Apr 15 22:24:39.752008 str-a7280cr3-2 INFO teamd#supervisord 2022-04-15 22:24:39,751 INFO exited: teamsyncd (exit status 1; not expected)
AND teamd docker restarts
Details if related
@judyjoseph IMHO, this is confusing. SIGTERM is a regular way to stop a process in Linux and the return code should be 0 if no errors observed
@judyjoseph IMHO, this is confusing. SIGTERM is a regular way to stop a process in Linux and the return code should be 0 if no errors observed
@nazariig this is not pertaining to SIGTERM alone - it is just that I used SIGTERM to validate this fix. For any reason teamsyncd/teammgrd comes out of the SELECT loop and exit, it is good for teamd container to restart. For example if teamsyncd exits siliently, some of the interface events will be missed.
A similar approach of using "exit 1" I see in other orchagent daemons like portsyncd, fpmsyncd etc - so that supervisor sees a not-expected exit and restarts the container.
/azp run
Azure Pipelines successfully started running 1 pipeline(s).
@prsunny Can you please help review this PR please? Since it is related to an ADO: https://msazure.visualstudio.com/One/_workitems/edit/13799016.
@judyjoseph IMHO, this is confusing. SIGTERM is a regular way to stop a process in Linux and the return code should be 0 if no errors observed
@nazariig this is not pertaining to SIGTERM alone - it is just that I used SIGTERM to validate this fix. For any reason teamsyncd/teammgrd comes out of the SELECT loop and exit, it is good for teamd container to restart. For example if teamsyncd exits siliently, some of the interface events will be missed.
A similar approach of using "exit 1" I see in other orchagent daemons like portsyncd, fpmsyncd etc - so that supervisor sees a not-expected exit and restarts the container.
@judyjoseph what is considered to be expected exit here? How are we going to handle graceful shutdown?