fms-fsdp
fms-fsdp copied to clipboard
Enable asynchronous dataloading
Current dataloader still causes gradual asymptotic slowdowns - likely because we have n_workers fixed to 0 in the dataloader. This forces the main process to also handle dataloading in a synchronous manner, allowing the two tasks to interfere. The reason we fix to 0 is because the main process performs checkpointing, and if dataloading is occurring on a separate worker process, the master cannot access the relevant state information from the worker.
This PR adds support for n_workers set to 1, allowing the worker to checkpoint itself at set intervals, separate from the model/optimizer checkpointing occurring in the master process. This is accomplished via a new Checkpoint_Dataset
wrapper that performs checkpointing on set intervals. Training script and other peripherals are updated to set n_workers to 1.
Note that while the Checkpoint_Dataset
class has been correctness-checked via the new unit test, the main training script has not yet been tested with these changes. We do not yet know if this PR will fix the throughput issue, and this should not be merged until we do.
can we add some prints/logging in the new checkpointer?
- when no data ckpt found, print something to indicate that (including which path it didn't find the ckpt), like what we did in the older checkpointer.
- when loading, also print the path (i.e. where it found the data ckpt).
- when saved, also print how much time it took, like what we did.
once everything looking good, we should also clean the old checkpointer to completely remove the data part.
Added the requested status reports, I figure we'll clean up the checkpointer utility once we have this tested and working to our satisfaction
@daviswer I just merged latest main to this branch.
all local tests passed and perf is better.