Tim Moon
Tim Moon
# What does this PR do ? Adds support for training GPT-3 with the [Apex implementation of the ZeRO optimizer](https://github.com/NVIDIA/apex/blob/master/apex/contrib/optimizers/distributed_fused_adam.py). **Collection**: NLP # Changelog - Add option for `distributed_fused_adam` optimizer...
The current issue is mitigated by https://github.com/LLNL/lbann/pull/2073. It now takes active effort to create unused bias weights in the convolution and fully-connected layers. This issue is a record in case...
Our current distconv support for deconvolution is limited to 2x2 deconvolution with stride 2. Fortunately, we already have implementations for 3x3 deconvolution: just swap the forward and backward steps from...
The ONNX conversion scripts are very old and broken. We have had a user run into problems when trying to use them, so I think it would be safer to...
I encounter an error when I change the mini-batch size in one of the Bamboo unit tests, e.g.: https://github.com/LLNL/lbann/blob/9a9e31cb33fd5460ad6da335ff647aad79088049/bamboo/unit_tests/test_unit_layer_identity.py#L45 If the mini-batch size is less than the number of processes,...
In NCHW tensor notation, the last dimension is the contiguous dimension. For column-major matrix notation, the first dimension is the contiguous dimension. We haven't needed to think that much about...
Our CI tests work well when we run with a single build per system, but I suspect things will go poorly if we try testing multiple compilers or multiple build...
# What does this PR do ? Generalize distributed Adam support for GPT-3 to T5 and other Megatron-LM models. It also implements several performance optimizations. **Collection**: NLP # Changelog -...
I've gotten incorrect results using distributed Adam to train GPT-3 at FP16 because of a bug with gradient clipping and gradient scaling. In particular, there's an incorrect assumption that gradient...
# What does this PR do ? Add a one line overview of what this PR aims to accomplish. **Collection**: [Note which collection this PR will affect] # Changelog -...