[BUG] flytescheduler Pod crashloops temporarily
Describe the bug
In the flyte-core deployment, the flytescheduler will crashloop a few times right after the deployment, trying to connect to flyte admin, which might not be ready to accept connections. After a few seconds things stabilise.
error:
panic: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp xxx.xxx.xxx.xxx:81: connect: connection refused"
We should wait until the flyteadmin is up and running, before trying to connect to it.
Expected behavior
No crash loop of the flytescheduler at start
Additional context to reproduce
Logs from the flytescheduler pod show:
panic: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp xxx.xxx.xxx.xxx:81: connect: connection refused"
Apparently, the scheduler connects to the flyteadmin before the flyteadmin is up and healthy; after a few retries it succeeds, so we should wait for the admin to be up before starting the scheduler
Screenshots
No response
Are you sure this issue hasn't been raised already?
- [X] Yes
Have you read the Code of Conduct?
- [X] Yes
Thank you for opening your first issue here! 🛠
https://flyte-org.slack.com/archives/C05L83TBEB1/p1692808445913509 (private link)
I think this is the correct behavior, it's just not a friendly UX. Instead of crashing it should be waiting. Or we should remove the check entirely and use a healthcheck endpoint and a small container image.
Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. Thank you for your contribution and understanding! 🙏
Hello 👋, this issue has been inactive for over 90 days and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏