Results 26 issues of Michael Wyatt

With the recent release of torch 1.12, we saw all unit tests using the `@distirbuted_test` decorator break (see [this issue](https://github.com/pytorch/pytorch/issues/68256)). The problem involves changes in `torch.multiprocessing` and `torch.distributed` that prevents...

Integrating the optimizers from the [MuP](https://github.com/microsoft/mup) project.

This PR fixes the initial implementation of LBANN-core interface (it had been broken at some point since it's initial merge) and extends the interface to allow Conduit nodes to be...

I'm running into a CUDA OOM error when loading this model due to the large size and lack of support for multi-GPU in HF pipeline.

Allow the users to pass a dictionary or [transformers.PretrainedConfig](https://huggingface.co/docs/transformers/v4.19.2/en/main_classes/configuration#transformers.PretrainedConfig) when deploying models.

enhancement

Refactor DeepSpeed Config sub-configs (i.e., activation checkpointing, autotuning, comms, compression, monitor, nebula, and profiling) to use the pydantic library.

Continuing refactor of distributed unit tests started in #2141 and #2180. Also includes a fix for the broken nightly test (lm-eval)

Discussion on #2379 has indicated that there are correctness issues when loading certain models from sharded checkpoints. Should be merged after #2662 @RezaYazdaniAminabadi

We are seeing random hangs, which we believe are caused by multiple pytest processes trying to compile the same code at once creating a deadlock. This PR sets a seperate...