ColossalAI-Examples
ColossalAI-Examples copied to clipboard
Examples of training models with hybrid parallelism using ColossalAI
### 🐛 Describe the bug Just run the `examples/language/opt/run_clm.py` will reproduce the error. The program crashed with no error information. After I replace placement_policy as 'cuda'. It is OK. ```...
Fixed the training script such that `len(dataloader)` works fine. This script is updated with the new zero api as well.
### 🐛 Describe the bug 使用了提供的Dockerhub上的镜像0.1.7,但是在运行GPT案例时候出现RuntimeError: Could not find 'SLURM_PROCID'问题,并且在0.1.8镜像版本中也是如此   这是我的run脚本:  其中我的gpt2_configs配置换了其他的配置也出现同样的问题 ### Environment docker pull hpcaitech/colossalai:0.1.7 & 0.1.8 pip install transformers pip install titans 8张A100
### 🐛 Describe the bug Hi I'm training bert using sequence parallel in colossal ai according to this [link](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/bert/sequene_parallel). But my training loss is too large, and it seems the...
### 🐛 Describe the bug Hi I'm running squence_parallel for bert pre-training, but I got this problem  What could I do to solve this problem? Thanks! ### Environment CUDA...
### 🐛 Describe the bug When I run a [vit experiment](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/vision_transformer/hybrid_parallel) by the following command ``` node=76 prefix="srun --nodes=1 --gres=gpu:4 --cpus-per-task=4 --ntasks=1 -w SG-IDC1-10-51-2-$node" $prefix colossalai run --nproc_per_node 4 train_with_cifar10.py...
Colossal-AI implementation of MAE, [arxiv](https//arxiv.org/abs/2111.06377). As an example, we just cover the pretrain phase with ImageNet 1000 mini dataset. Helpers under subdir [util/](./util/) are from [facebookresearch/deit](https://github.com/facebookresearch/deit), under Apache License 2.0....
### 🐛 Describe the bug `models.shufflenet_v2_x1_0` can be trained with `BATCH_SIZE = 16384`, which cannot be run successfully with ColossalAI. The information is below: ```bash (conda-general) user@user:~/research/Experiments/ColossalAI-Examples/image/resnet$ colossalai run --nproc_per_node...