metaflow icon indicating copy to clipboard operation
metaflow copied to clipboard

[Ready for Review] Improve native resume

Open darinyu opened this issue 1 year ago • 2 comments

Extend the resume speedup change to native(local) resume.

Note:

  • We will go over all the successful tasks and copy them first. Then we reconstruct runtime queue to continue.
  • We will keep UBF resume behavior the same. i.e if some mapper tasks fail, all mapper tasks will reran in the resume.

A few test example flows:

  • Resume w/ some branch: link
  • Resume during failure on foreach split :link
  • Resume during failure on UBF split: link
  • Resume during failure on UBF join: link

To test the above flows:

python flow.py run
python flow.py resume

Benchmark: Resume-Speed-Test-Google-Docs

Open question:

  • We only consider last "run" but never consider last "resume", so running multiple resume will anchor on the same run_id (instead of previous resume). If running resume consecutively, should we continue on the resume?

darinyu avatar Jun 11 '24 18:06 darinyu

@darinyu can you walk me through the nature of changes for this PR so that we can have a quick review turnaround?

savingoyal avatar Jun 18 '24 14:06 savingoyal

not related to this PR but maybe we can mute this log line

savingoyal avatar Jun 18 '24 16:06 savingoyal

ready to go

savingoyal avatar Jul 19 '24 17:07 savingoyal