Shenggui Li
Shenggui Li
The previous PR #1405 implemented the sharding spec. This PR implements the linear distributed computation using the new sharding spec API.
### 🐛 Describe the bug ZeRO will keep throwing overflow if used together with momentum SGD in the [resnet example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/resnet). The code works fine with all kinds of amp. ###...
### Proposal In the current model zoo and examples, it can be often seen that one model has two different implementations, e.g. GPT and PipelineGPT. This is because that some...
### Describe the feature Currently, Colossal-AI requires at least PyTorch 1.8 at this is the lowest version which provides holistic communication operations. However, PyTorch 1.8 does not support directly initialize...
### 🐛 Describe the bug When running unit test with torch 1.8, the unit tests for moe module failed as shown below. The error occurs because the API of `torch.nn.Linear`...
The current LAMB optimizer implementation does not support tensor parallel as it needs to compute norm of the whole matrix. It is not compatible with tensor parallel as the tensor...
Fixed the training script such that `len(dataloader)` works fine. This script is updated with the new zero api as well.
### Describe the feature In most examples, there are two files, namely train with engine and trainer. The code is highly redundant in these two files and we should just...
Need to provide an example of doing inference, this should be synced in the documentation as well.
### Describe the feature In the current Colossal-AI implementation, we build Colossal-AI in two ways: 1. built when doing `CUDA_EXT=1 pip install colossalai` 2. build the CUDA kernel when importing...