composer icon indicating copy to clipboard operation
composer copied to clipboard

Supercharge Your Model Training

Results 263 composer issues
Sort by recently updated
recently updated
newest added
trafficstars

# What does this PR do? # What issue(s) does this change relate to? # Before submitting - [ ] Have you read the [contributor guidelines](https://github.com/mosaicml/composer/blob/dev/CONTRIBUTING.md)? - [ ] Is...

# What 1. searched the repo, replace ALL object_store.download_object() with the retry version to avoid potential downloading error in future. 2. moved the `download_object_or_file` function to `file_helpers.py` to make 1...

This line seems to be the issue in `MemorySnapshot`: `remote_file_name = (self.remote_path_in_bucket + os.path.basename(f)).lstrip('/')` where the respective variables evaluate to e.g. ``` self.remote_path_in_bucket = '{run_name}/torch_memory_traces/rank{rank}.{batch}.memory_snapshot' (self.remote_path_in_bucket + os.path.basename(f)).lstrip('/') = '{run_name}/torch_memory_traces/rank{rank}.{batch}.memory_snapshotrank0.4.memory_snapshot.pickle'...

bug

For my use case, I would like to augment the training data with features produced by the model itself. More specifically, my experiment is structured as follows: - Train the...

# What does this PR do? Updates mlflow logger `log_image` to use the new API with time-dimension. This will enable viewing the images in MLflow # What issue(s) does this...

# What does this PR do? # What issue(s) does this change relate to? # Before submitting - [ ] Have you read the [contributor guidelines](https://github.com/mosaicml/composer/blob/dev/CONTRIBUTING.md)? - [ ] Is...

Draft testing multi-gpu ci testing

# What does this PR do? Adds an API for extracting optimizer state dict from a model and optimizer object. State dict generation is a necessary operation before the save...

Turns out it's empty dict for nonzero ranks for unsharded state dicts because for torch 2.1.2 we set the `FullStateDictConfig` `rank0_only` flag to `True` and for torch >2.1.2, the `dcp.get_model_state_dict`...

# What does this PR do? Add torch distributed checkpointing monkeypatches to enable TE checkpointing for extra_state attribute. Patches the internal `torch.distributed.state_dict` functions: ``` state_dict._get_fqns = _get_fqns state_dict._verify_options = _verify_options...