sonic-swss icon indicating copy to clipboard operation
sonic-swss copied to clipboard

On teammgrd/teamsyncd exits, return EXIT_FAILURE

Open judyjoseph opened this issue 2 years ago • 6 comments

What I did When teammgrd/teamsyncd exits -- return FAILURE so that supervisord catch it and teamd docker is restarted.

Why I did it Fixes https://github.com/Azure/sonic-buildimage/issues/10534

I have seen this in builds from 201911 to master.

How I verified it Checked by sending SIGTERM to teamsyncd/teammgrd processes


Apr 15 21:57:38.156111 str-a7280cr3-2 INFO teamd#supervisord 2022-04-15 21:57:38,155 INFO exited: teamsyncd (exit status 0; expected)

Apr 15 22:20:09.530223 str-a7280cr3-2 INFO teamd#supervisord 2022-04-15 22:20:09,529 INFO exited: teammgrd (exit status 0; expected)

-- with fix

Apr 15 22:24:39.752008 str-a7280cr3-2 INFO teamd#supervisord 2022-04-15 22:24:39,751 INFO exited: teamsyncd (exit status 1; not expected)
AND teamd docker restarts


Details if related

judyjoseph avatar Apr 15 '22 22:04 judyjoseph

@judyjoseph IMHO, this is confusing. SIGTERM is a regular way to stop a process in Linux and the return code should be 0 if no errors observed

nazariig avatar Apr 18 '22 16:04 nazariig

@judyjoseph IMHO, this is confusing. SIGTERM is a regular way to stop a process in Linux and the return code should be 0 if no errors observed

@nazariig this is not pertaining to SIGTERM alone - it is just that I used SIGTERM to validate this fix. For any reason teamsyncd/teammgrd comes out of the SELECT loop and exit, it is good for teamd container to restart. For example if teamsyncd exits siliently, some of the interface events will be missed.

A similar approach of using "exit 1" I see in other orchagent daemons like portsyncd, fpmsyncd etc - so that supervisor sees a not-expected exit and restarts the container.

judyjoseph avatar Apr 27 '22 20:04 judyjoseph

/azp run

judyjoseph avatar Apr 27 '22 20:04 judyjoseph

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Apr 27 '22 20:04 azure-pipelines[bot]

@prsunny Can you please help review this PR please? Since it is related to an ADO: https://msazure.visualstudio.com/One/_workitems/edit/13799016.

yozhao101 avatar May 25 '22 07:05 yozhao101

@judyjoseph IMHO, this is confusing. SIGTERM is a regular way to stop a process in Linux and the return code should be 0 if no errors observed

@nazariig this is not pertaining to SIGTERM alone - it is just that I used SIGTERM to validate this fix. For any reason teamsyncd/teammgrd comes out of the SELECT loop and exit, it is good for teamd container to restart. For example if teamsyncd exits siliently, some of the interface events will be missed.

A similar approach of using "exit 1" I see in other orchagent daemons like portsyncd, fpmsyncd etc - so that supervisor sees a not-expected exit and restarts the container.

@judyjoseph what is considered to be expected exit here? How are we going to handle graceful shutdown?

nazariig avatar Jul 06 '22 09:07 nazariig