Ning Wang
Ning Wang
do you have a minimal repro?
Almost looks good to me. just to clarify, do you want to add unit test for HSDP + TP checkpointing? If so, do you want to add that in this...
> Thanks for the PR! Only one minor comment in the PR. However, if `get_file` calls `download_object_or_file`, should the rest of the codebase use `get_file` instead of `download_object_or_file`? @b-chu ,...
i will not merge this PR for now, because i just find it regresses the auto-resume. With auto-resume, it will try to `get_file(checkpoint_path)` no matter if checkpoint_path exits or not...