fms-fsdp icon indicating copy to clipboard operation
fms-fsdp copied to clipboard

Enable asynchronous dataloading

Open daviswer opened this issue 10 months ago • 3 comments

Current dataloader still causes gradual asymptotic slowdowns - likely because we have n_workers fixed to 0 in the dataloader. This forces the main process to also handle dataloading in a synchronous manner, allowing the two tasks to interfere. The reason we fix to 0 is because the main process performs checkpointing, and if dataloading is occurring on a separate worker process, the master cannot access the relevant state information from the worker.

This PR adds support for n_workers set to 1, allowing the worker to checkpoint itself at set intervals, separate from the model/optimizer checkpointing occurring in the master process. This is accomplished via a new Checkpoint_Dataset wrapper that performs checkpointing on set intervals. Training script and other peripherals are updated to set n_workers to 1.

Note that while the Checkpoint_Dataset class has been correctness-checked via the new unit test, the main training script has not yet been tested with these changes. We do not yet know if this PR will fix the throughput issue, and this should not be merged until we do.

daviswer avatar Apr 17 '24 23:04 daviswer

can we add some prints/logging in the new checkpointer?

  1. when no data ckpt found, print something to indicate that (including which path it didn't find the ckpt), like what we did in the older checkpointer.
  2. when loading, also print the path (i.e. where it found the data ckpt).
  3. when saved, also print how much time it took, like what we did.

once everything looking good, we should also clean the old checkpointer to completely remove the data part.

lchu-ibm avatar Apr 19 '24 14:04 lchu-ibm

Added the requested status reports, I figure we'll clean up the checkpointer utility once we have this tested and working to our satisfaction

daviswer avatar May 06 '24 20:05 daviswer

@daviswer I just merged latest main to this branch.

lchu-ibm avatar May 08 '24 12:05 lchu-ibm

all local tests passed and perf is better.

lchu-ibm avatar May 10 '24 14:05 lchu-ibm