flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[Core Feature] Allow tasks/config to specify max queue/wait time

Open EngHabu opened this issue 3 years ago • 3 comments

Motivation: Why do you think this is important? In cases when the underlying execution engine (AWS Batch, K8s, Spark, Hive, AWS EMR, GCP BigQuery... etc.) is having issues scheduling flyte workloads, sometimes the workload get stuck. While Flyte has a concept of timeout, it only measures the execution timeout overall. Which doesn't allow the users to express their tolerance for how much they can wait in a queue to get a task executing.

Goal: What should the final outcome look like, ideally? Expose an additional queue_timeout flag that can be set at a global scope through configs, or at a task scope (ideally can also be on a project/domain/WF levels). And when flytepropeller detects that a task hasn't started executing for that period of time, it should just abort it.

EngHabu avatar Jun 15 '21 18:06 EngHabu

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

github-actions[bot] avatar Aug 26 '23 00:08 github-actions[bot]

Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

github-actions[bot] avatar Sep 03 '23 00:09 github-actions[bot]

While it's nice that we can now configure a pod-pending-timeout, we have pretty much the opposite problem: We thought that the timeout really refers to only the actual execution time (without pending time) as the docs suggest: https://docs.flyte.org/en/latest/api/flytekit/generated/flytekit.TaskMetadata.html?highlight=timeout#flytekit.TaskMetadata We can only set a meaningful execution timeout and it can happen that sometimes tasks are queued for quite a long time until a GPU node is available (which is what we want). So IMHO we should change the current timeout to mean only the execution timeout (as per current docs). Or add a new param that sets only the execution timeout.

flixr avatar Apr 11 '24 10:04 flixr