Michael Wyatt
Michael Wyatt
With the recent release of torch 1.12, we saw all unit tests using the `@distirbuted_test` decorator break (see [this issue](https://github.com/pytorch/pytorch/issues/68256)). The problem involves changes in `torch.multiprocessing` and `torch.distributed` that prevents...
Integrating the optimizers from the [MuP](https://github.com/microsoft/mup) project.
This PR fixes the initial implementation of LBANN-core interface (it had been broken at some point since it's initial merge) and extends the interface to allow Conduit nodes to be...
I'm running into a CUDA OOM error when loading this model due to the large size and lack of support for multi-GPU in HF pipeline.
Allow the users to pass a dictionary or [transformers.PretrainedConfig](https://huggingface.co/docs/transformers/v4.19.2/en/main_classes/configuration#transformers.PretrainedConfig) when deploying models.
Refactor DeepSpeed Config sub-configs (i.e., activation checkpointing, autotuning, comms, compression, monitor, nebula, and profiling) to use the pydantic library.
Continuing refactor of distributed unit tests started in #2141 and #2180. Also includes a fix for the broken nightly test (lm-eval)
Discussion on #2379 has indicated that there are correctness issues when loading certain models from sharded checkpoints. Should be merged after #2662 @RezaYazdaniAminabadi
We are seeing random hangs, which we believe are caused by multiple pytest processes trying to compile the same code at once creating a deadlock. This PR sets a seperate...