Remote checkpoint save and restore (in S3 or other cloud storage)
⚠️ Please check that this feature request hasn't been suggested before.
- [X] I searched previous Ideas in Discussions didn't find any similar feature requests.
- [X] I searched previous Issues didn't find any similar feature requests.
🔖 Feature description
A method to save checkpoints to a cloud object store by setting a configuration like remote_output_dir (based on current output_dir), and also restore from them. This would enable cheaper training runs as spot instances could be used, or tools like Sky Pilot could move through different cloud environments as prices fluctuate. 1
✔️ Solution
Through a setting, the user would provide a remote URL which would be used by axolotl (through a callback) to mirror checkpoints and the final model as they are saved to the disk. If auto_resume_from_checkpoints or resume_from_checkpoint is set, it would also allow the system to load the checkpoints from the remote storage before starting the training.
❓ Alternatives
No response
📝 Additional Context
No response
Acknowledgements
- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this feature has not been requested yet.
- [X] I have provided enough information for the maintainers to understand and evaluate this request.
Hey, has anything like this been implemented? Best regards!
@MauritzJobMetis I know this is a year old, but here is the current solution:
You can set your hub_model_id: your_org/your_model_name and then on hugging face it's possible to connect to S3 by mounting the drive.