flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[BUG] flytescheduler Pod crashloops temporarily

Open gdabisias opened this issue 2 years ago • 4 comments

Describe the bug

In the flyte-core deployment, the flytescheduler will crashloop a few times right after the deployment, trying to connect to flyte admin, which might not be ready to accept connections. After a few seconds things stabilise.

error:

panic: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp xxx.xxx.xxx.xxx:81: connect: connection refused"

We should wait until the flyteadmin is up and running, before trying to connect to it.

Expected behavior

No crash loop of the flytescheduler at start

Additional context to reproduce

Logs from the flytescheduler pod show:

panic: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp xxx.xxx.xxx.xxx:81: connect: connection refused"

Apparently, the scheduler connects to the flyteadmin before the flyteadmin is up and healthy; after a few retries it succeeds, so we should wait for the admin to be up before starting the scheduler

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • [X] Yes

Have you read the Code of Conduct?

  • [X] Yes

gdabisias avatar Aug 29 '23 16:08 gdabisias

Thank you for opening your first issue here! 🛠

welcome[bot] avatar Aug 29 '23 16:08 welcome[bot]

https://flyte-org.slack.com/archives/C05L83TBEB1/p1692808445913509 (private link)

wild-endeavor avatar Aug 29 '23 16:08 wild-endeavor

I think this is the correct behavior, it's just not a friendly UX. Instead of crashing it should be waiting. Or we should remove the check entirely and use a healthcheck endpoint and a small container image.

wild-endeavor avatar Sep 01 '23 18:09 wild-endeavor

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. Thank you for your contribution and understanding! 🙏

github-actions[bot] avatar May 29 '24 00:05 github-actions[bot]

Hello 👋, this issue has been inactive for over 90 days and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

github-actions[bot] avatar May 16 '25 00:05 github-actions[bot]