composer issues

Add benchmark option to runtime estimator

# What does this PR do? # What issue(s) does this change relate to? # Before submitting - [ ] Have you read the [contributor guidelines](https://github.com/mosaicml/composer/blob/dev/CONTRIBUTING.md)? - [ ] Is...

aspfohl

[Not for merge] refactor the object_store.download_object retry and add more retry

2

# What 1. searched the repo, replace ALL object_store.download_object() with the retry version to avoid potential downloading error in future. 2. moved the `download_object_or_file` function to `file_helpers.py` to make 1...

bigning

Remote file name in `MemorySnapshot` not being formatted

1

This line seems to be the issue in `MemorySnapshot`: `remote_file_name = (self.remote_path_in_bucket + os.path.basename(f)).lstrip('/')` where the respective variables evaluate to e.g. ``` self.remote_path_in_bucket = '{run_name}/torch_memory_traces/rank{rank}.{batch}.memory_snapshot' (self.remote_path_in_bucket + os.path.basename(f)).lstrip('/') = '{run_name}/torch_memory_traces/rank{rank}.{batch}.memory_snapshotrank0.4.memory_snapshot.pickle'...

AleksanderWWW

bug

Augment training batches with "on-the-fly" features

For my use case, I would like to augment the training data with features produced by the model itself. More specifically, my experiment is structured as follows: - Train the...

Riccorl

log image fix

# What does this PR do? Updates mlflow logger `log_image` to use the new API with time-dimension. This will enable viewing the images in MLflow # What issue(s) does this...

jessechancy

Add torch 2.4.0 nightly

# What does this PR do? # What issue(s) does this change relate to? # Before submitting - [ ] Have you read the [contributor guidelines](https://github.com/mosaicml/composer/blob/dev/CONTRIBUTING.md)? - [ ] Is...

j316chuck

Multi gpu ci test

Draft testing multi-gpu ci testing

KuuCi

[ckpt-rewr] Get Optim State Dict Util API

# What does this PR do? Adds an API for extracting optimizer state dict from a model and optimizer object. State dict generation is a necessary operation before the save...

eracah

Update docstring for get_model_state_dict

Turns out it's empty dict for nonzero ranks for unsharded state dicts because for torch 2.1.2 we set the `FullStateDictConfig` `rank0_only` flag to `True` and for torch >2.1.2, the `dcp.get_model_state_dict`...

eracah

Add torch distributed checkpointing monkeypatches to enable TE checkpointing for extra_state attribute

# What does this PR do? Add torch distributed checkpointing monkeypatches to enable TE checkpointing for extra_state attribute. Patches the internal `torch.distributed.state_dict` functions: ``` state_dict._get_fqns = _get_fqns state_dict._verify_options = _verify_options...

j316chuck

composer
composer copied to clipboard

Metadata

Add benchmark option to runtime estimator

[Not for merge] refactor the object_store.download_object retry and add more retry

Remote file name in `MemorySnapshot` not being formatted

Augment training batches with "on-the-fly" features

log image fix

Add torch 2.4.0 nightly

Multi gpu ci test

[ckpt-rewr] Get Optim State Dict Util API

Update docstring for get_model_state_dict

Add torch distributed checkpointing monkeypatches to enable TE checkpointing for extra_state attribute

← Metadata

Owner

Metadata

composer composer copied to clipboard

Metadata

← Metadata

Owner

Metadata

composer
composer copied to clipboard