metaflow
metaflow copied to clipboard
[Ready for Review] Improve native resume
Extend the resume speedup change to native(local) resume.
Note:
- We will go over all the successful tasks and copy them first. Then we reconstruct runtime queue to continue.
- We will keep UBF resume behavior the same. i.e if some mapper tasks fail, all mapper tasks will reran in the resume.
A few test example flows:
- Resume w/ some branch: link
- Resume during failure on foreach split :link
- Resume during failure on UBF split: link
- Resume during failure on UBF join: link
To test the above flows:
python flow.py run
python flow.py resume
Benchmark:
Open question:
- We only consider last "run" but never consider last "resume", so running multiple resume will anchor on the same run_id (instead of previous resume). If running resume consecutively, should we continue on the resume?
@darinyu can you walk me through the nature of changes for this PR so that we can have a quick review turnaround?
not related to this PR but maybe we can mute this log line
ready to go