composer
composer copied to clipboard
Supercharge Your Model Training
# What does this PR do? Add ARM support to docker images
# What does this PR do? Adds metadata logging to mosaicml logger for when checkpoint upload starts when a checkpoint is done loading. The information logged is the timestamp and...
# What does this PR do? This PR uses `_record_memory_history_impl` instead of `_record_memory_history_legacy` (previous) to capture memory snapshot. See https://github.com/pytorch/pytorch/blob/main/torch/cuda/memory.py#L698-L738. With `enabled =all`, this captures all (c++ and python) alloc/free...
## 🚀 Feature Request The Trainer instantiates a `CheckpointSaver` when the `save_folder` is specified (see [this line](https://github.com/mosaicml/composer/blob/57c7b72b9df41b0c9777bad1c2bec17f3103c31f/composer/trainer/trainer.py#L1366)). The creation of this object should be optional in case there is already...
# What does this PR do? # What issue(s) does this change relate to? # Before submitting - [ ] Have you read the [contributor guidelines](https://github.com/mosaicml/composer/blob/dev/CONTRIBUTING.md)? - [ ] Is...
# What does this PR do? State dict generation currently in composer is coupled with the State, and not very readable, hard to extend, hard to test, and hard for...
# What does this PR do? * Adding support for ARM images within Action and dynamic package builds within Dockerfile * Using larger gh hosted runner to: 1. Add more...
# What does this PR do? # What issue(s) does this change relate to? # Before submitting - [ ] Have you read the [contributor guidelines](https://github.com/mosaicml/composer/blob/dev/CONTRIBUTING.md)? - [ ] Is...
Hi @mvpatel2000 , This is a question on the shuffling of the dataset. So I was training on a cloud gpu setup and after 40% training I got an OOM...
# What does this PR do? Fix the symlink issues. **How**? [updated]: in the checkpoint saver, on rank-0 which saves the symlink, it all_gather the remote checkpoint file names, and...