flytekit icon indicating copy to clipboard operation
flytekit copied to clipboard

[WIP] [Feat-v2] Enable memory increase on OOM failure

Open Mecoli1219 opened this issue 9 months ago • 1 comments

Tracking issue

https://github.com/flyteorg/flyte/issues/2234

Why are the changes needed?

This PR is not ready yet

What changes were proposed in this pull request?

  • [x] A new Retry dataclass to specify how flyte handles OOM event
  • [ ] Add more unit tests
  • [ ] Add helper function (or maybe open a new PR in the future)

How was this patch tested?

  1. Compile https://github.com/flyteorg/flyte/pull/6293 to a single binary
  2. Run the single binary
  3. Run the code:
from flytekit import task, workflow, Resources, ImageSpec, Retry, OnOOM, Backoff
from datetime import timedelta

@task(
    requests=Resources(mem="16Mi"),
    limits=Resources(mem="64Mi"),
    retries=Retry(
        attempts=5,
        on_oom=OnOOM(
            backoff=Backoff(exponent=4, max=timedelta(minutes=2)),
            factor=1.2,
            limit="30Mi",
        ),
    ),
    container_image=image,
)
def oom_task() -> int:
    a = [1]
    for i in range(10000):
        a = a + [1]
    return 10

@workflow
def oom_wf() -> int:
    return oom_task()

Setup process

Screenshots

Check all the applicable boxes

  • [ ] I updated the documentation accordingly.
  • [ ] All new and existing tests passed.
  • [ ] All commits are signed-off.

Related PRs

https://github.com/flyteorg/flyte/pull/6293

Docs link

Mecoli1219 avatar Mar 02 '25 03:03 Mecoli1219

Code Review Agent Run Status

  • Limitations and other issues:  Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

flyte-bot avatar Mar 02 '25 03:03 flyte-bot