ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

Making large AI models cheaper, faster and more accessible

Results 1091 ColossalAI issues
Sort by recently updated
recently updated
newest added

use p2p_v2 to recontruct pipeline_schedule and test it with `tests/test_data_pipeline_tensor_parallel/test_cifar_with_data_pipeline_tensor_v2.py`

Run Build and Test

### Describe the feature As a researcher developing auto parallelism, it would be great if we can use Colossal-AI to test the parallel schedule we design. Currently, Colossal-AI supports 1D,2D,2.5D,...

enhancement

### 🐛 Describe the bug I use https://github.com/hpcaitech/ColossalAI-Examples to start GPT2 training example. But it seems run train_gpt.py FAILED. Can anyone give some helps? the code root: `ColossalAI-Examples/language/gpt` my script:...

bug

At least as far as I know, ZeRO2 splits gradients and PP accumulates gradients, so there's no real performance boost for these two mechanisms working together. ### related issues -...

documentation

### Describe the feature Application monitoring is essential for every production software system. Prometheus is an open-source monitoring system which was created in 2012 by Soundcloud. The Prometheus server collects...

### 🐛 Describe the bug pytorch 1.9+cuda11.1+Nvidia A30 GPU after install successfully, run `colossalai check -i` show the error below. ``` Traceback (most recent call last): File "/home/liuzixi01/.conda/envs/torch-cuda11/bin/colossalai", line 5,...

bug

### 🐛 Describe the bug I run the bert from huggingface with zero, but get RuntimeError: CUDA error: an illegal memory access was encountered, I found that this problem seemed...

bug

### Describe the feature Dear Collosal-ai Team, @FrankLeeeee @feifeibear. My name is Thomas Chaton. I am part of PyTorch Lightning Team. First of all, congrats on your latest results, those...

enhancement

### Describe the feature I found that ColoTensor lacks some basic functionalities. - [x] initialized in shard mode from a torch tensor. - [x] save and load in a distributed...

enhancement

### Describe the feature The EMA is used to train many models, an example can be found here: https://github.com/CompVis/latent-diffusion/blob/main/ldm/modules/ema.py But with model sharding it is unclear how to implement ema...

enhancement