torchtune
torchtune copied to clipboard
[feature request] support input/output to fsspec path
Support loading model from s3 and save checkpoint to fsspec path.
Hi @leoleoasd thanks for creating the issue. @joecummings is working on updating some of our checkpointing abstractions, so he may have some thoughts on this.
I researched a little, and I found that safetensor's safeopen uses rust's File::open and MMAP:
https://github.com/huggingface/safetensors/blob/e61e87240d0eabc9749a67ccebe38dca620d48b4/bindings/python/src/lib.rs#L396-L399
So this may not be possible with safetensor?
I'll have to dig a little deeper on safetensor specifically, but I think the PyTorch Checkpointing team has worked on enabling S3 read/write through use of DCP StorageReaders and StorageWriters: https://pytorch.org/docs/main/distributed.checkpoint.html#torch.distributed.checkpoint.StorageReader
@saumishr / @ankitageorge Is there a good resource for how to create your own adapter and what might be needed for S3?
I have a question about Custom Checkpointing interfaces.
I want to write a custom checkpointing class that just puts the data in s3 (to a path I desire) when the current interface finishes writing them to disk. As long as the current interface returns a path to the written checkpoint along with some metadata about what was written (loss/epoch/step), I can figure the rest. Is the current interface capable of this ?