flytekit
flytekit copied to clipboard
[WIP] [Feat-v2] Enable memory increase on OOM failure
Tracking issue
https://github.com/flyteorg/flyte/issues/2234
Why are the changes needed?
This PR is not ready yet
What changes were proposed in this pull request?
- [x] A new Retry dataclass to specify how flyte handles OOM event
- [ ] Add more unit tests
- [ ] Add helper function (or maybe open a new PR in the future)
How was this patch tested?
- Compile https://github.com/flyteorg/flyte/pull/6293 to a single binary
- Run the single binary
- Run the code:
from flytekit import task, workflow, Resources, ImageSpec, Retry, OnOOM, Backoff
from datetime import timedelta
@task(
requests=Resources(mem="16Mi"),
limits=Resources(mem="64Mi"),
retries=Retry(
attempts=5,
on_oom=OnOOM(
backoff=Backoff(exponent=4, max=timedelta(minutes=2)),
factor=1.2,
limit="30Mi",
),
),
container_image=image,
)
def oom_task() -> int:
a = [1]
for i in range(10000):
a = a + [1]
return 10
@workflow
def oom_wf() -> int:
return oom_task()
Setup process
Screenshots
Check all the applicable boxes
- [ ] I updated the documentation accordingly.
- [ ] All new and existing tests passed.
- [ ] All commits are signed-off.
Related PRs
https://github.com/flyteorg/flyte/pull/6293
Docs link
Code Review Agent Run Status
- Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].