flyte
flyte copied to clipboard
[Core Feature] Allow tasks/config to specify max queue/wait time
Motivation: Why do you think this is important? In cases when the underlying execution engine (AWS Batch, K8s, Spark, Hive, AWS EMR, GCP BigQuery... etc.) is having issues scheduling flyte workloads, sometimes the workload get stuck. While Flyte has a concept of timeout, it only measures the execution timeout overall. Which doesn't allow the users to express their tolerance for how much they can wait in a queue to get a task executing.
Goal: What should the final outcome look like, ideally?
Expose an additional queue_timeout
flag that can be set at a global scope through configs, or at a task scope (ideally can also be on a project/domain/WF levels). And when flytepropeller detects that a task hasn't started executing for that period of time, it should just abort it.
Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏
Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏
While it's nice that we can now configure a pod-pending-timeout
, we have pretty much the opposite problem:
We thought that the timeout really refers to only the actual execution time (without pending time) as the docs suggest:
https://docs.flyte.org/en/latest/api/flytekit/generated/flytekit.TaskMetadata.html?highlight=timeout#flytekit.TaskMetadata
We can only set a meaningful execution timeout and it can happen that sometimes tasks are queued for quite a long time until a GPU node is available (which is what we want).
So IMHO we should change the current timeout to mean only the execution timeout (as per current docs). Or add a new param that sets only the execution timeout.