Michael Wyatt issues

Results 26 issues of


                                            Michael Wyatt

Fix for distributed tests on pytorch>=1.12

With the recent release of torch 1.12, we saw all unit tests using the `@distirbuted_test` decorator break (see [this issue](https://github.com/pytorch/pytorch/issues/68256)). The problem involves changes in `torch.multiprocessing` and `torch.distributed` that prevents...

Add MuP optimizers

Integrating the optimizers from the [MuP](https://github.com/microsoft/mup) project.

LBANN-core Conduit node data interface

This PR fixes the initial implementation of LBANN-core interface (it had been broken at some point since it's initial merge) and extends the interface to allow Conduit nodes to be...

Add support for HuggingFace GPT-NeoX implementation

I'm running into a CUDA OOM error when loading this model due to the large size and lack of support for multi-GPU in HF pipeline.

Custom model configs

Allow the users to pass a dictionary or [transformers.PretrainedConfig](https://huggingface.co/docs/transformers/v4.19.2/en/main_classes/configuration#transformers.PretrainedConfig) when deploying models.

enhancement

Pydantify sub-configs

Refactor DeepSpeed Config sub-configs (i.e., activation checkpointing, autotuning, comms, compression, monitor, nebula, and profiling) to use the pydantic library.

Refactor remaining distributed tests

Continuing refactor of distributed unit tests started in #2141 and #2180. Also includes a fix for the broken nightly test (lm-eval)

Support latest Transformers and new cache design

Add correctness check for sharded checkpoint test

Discussion on #2379 has indicated that there are correctness issues when loading certain models from sharded checkpoints. Should be merged after #2662 @RezaYazdaniAminabadi

Prevent hangs in CI during parallel run compilation

We are seeing random hangs, which we believe are caused by multiple pytest processes trying to compile the same code at once creating a deadlock. This PR sets a seperate...