ColossalAI
ColossalAI copied to clipboard
Making large AI models cheaper, faster and more accessible
use p2p_v2 to recontruct pipeline_schedule and test it with `tests/test_data_pipeline_tensor_parallel/test_cifar_with_data_pipeline_tensor_v2.py`
### Describe the feature As a researcher developing auto parallelism, it would be great if we can use Colossal-AI to test the parallel schedule we design. Currently, Colossal-AI supports 1D,2D,2.5D,...
### 🐛 Describe the bug I use https://github.com/hpcaitech/ColossalAI-Examples to start GPT2 training example. But it seems run train_gpt.py FAILED. Can anyone give some helps? the code root: `ColossalAI-Examples/language/gpt` my script:...
At least as far as I know, ZeRO2 splits gradients and PP accumulates gradients, so there's no real performance boost for these two mechanisms working together. ### related issues -...
### Describe the feature Application monitoring is essential for every production software system. Prometheus is an open-source monitoring system which was created in 2012 by Soundcloud. The Prometheus server collects...
### 🐛 Describe the bug pytorch 1.9+cuda11.1+Nvidia A30 GPU after install successfully, run `colossalai check -i` show the error below. ``` Traceback (most recent call last): File "/home/liuzixi01/.conda/envs/torch-cuda11/bin/colossalai", line 5,...
### 🐛 Describe the bug I run the bert from huggingface with zero, but get RuntimeError: CUDA error: an illegal memory access was encountered, I found that this problem seemed...
### Describe the feature Dear Collosal-ai Team, @FrankLeeeee @feifeibear. My name is Thomas Chaton. I am part of PyTorch Lightning Team. First of all, congrats on your latest results, those...
### Describe the feature I found that ColoTensor lacks some basic functionalities. - [x] initialized in shard mode from a torch tensor. - [x] save and load in a distributed...
### Describe the feature The EMA is used to train many models, an example can be found here: https://github.com/CompVis/latent-diffusion/blob/main/ldm/modules/ema.py But with model sharding it is unclear how to implement ema...