ColossalAI
ColossalAI copied to clipboard
Making large AI models cheaper, faster and more accessible
### What's new? After a discussion with my collaborator @Cypher30, I figured out my misinterpretation of algorithm 3 in https://arxiv.org/abs/1604.06174. This PR can be viewed as an amendment to this...
The previous PR #1405 implemented the sharding spec. This PR implements the linear distributed computation using the new sharding spec API.
### 🐛 Describe the bug I am using the environment below to install colossalAI and it tell me that I am successfully install it  However, when I tried to...
### Describe the feature I have tested this frame, and the used memory is reduced, but the speed don't have booted, can you push the code examples in the readme?
### 🐛 Describe the bug I suppose [`ParallelFreqAwareEmbeddingBag._weight`](https://github.com/hpcaitech/ColossalAI/blob/3b26516c69a37ec9731e8dd0245436fd4c03120f/colossalai/nn/parallel/layers/cache_embedding/parallel_freq_aware_embedding.py#L63) should be a ColoTensor, otherwise it would be moved to GPU when some Model containing this embedding uses .cuda(). That's why I...
As our future automatic parallelization might need to offload the checkpoint input for memory saving, I 1. Replace the origin torch checkpoint function with colossal.utils.checkpoint, which has the inference for...
In #1418 PR, I add ShapeConsistencyManager into colossalai to support auto parallel strategy search and runtime sharding spec apply. This PR complete the auto parallel strategy search supporting part, mainly...
### 🐛 Describe the bug I ran into error when training the MoE example(https://github.com/hpcaitech/ColossalAI-Examples/tree/5b23e8cf22cf029b9ac77c2ed92bbc339e7fbd4e/image/moe), each time when upon finishing the last iteration, it threw the following errors while CUDA shutting...
### 🐛 Describe the bug When parallel is set to pipeline=4 and tensor=dict(mode='2d', size=4), the program will get stuck on initialization and no error message will be output. ### Environment...
### 🐛 Describe the bug The configuration information uses the example, and only the data source is changed. Configurations from 1d to pp work properly, but an error is reported...