ColossalAI-Examples icon indicating copy to clipboard operation
ColossalAI-Examples copied to clipboard

Examples of training models with hybrid parallelism using ColossalAI

Results 37 ColossalAI-Examples issues
Sort by recently updated
recently updated
newest added

### 🐛 Describe the bug Just run the `examples/language/opt/run_clm.py` will reproduce the error. The program crashed with no error information. After I replace placement_policy as 'cuda'. It is OK. ```...

Fixed the training script such that `len(dataloader)` works fine. This script is updated with the new zero api as well.

### 🐛 Describe the bug 使用了提供的Dockerhub上的镜像0.1.7,但是在运行GPT案例时候出现RuntimeError: Could not find 'SLURM_PROCID'问题,并且在0.1.8镜像版本中也是如此 ![M4QKMAI7`6Q~U9`52 KAY5Y](https://user-images.githubusercontent.com/65949265/180979456-7e4453c0-605c-4825-89c7-073f81612a29.png) ![T4GKG9P$KSS$XIGXL7{EVAM](https://user-images.githubusercontent.com/65949265/180979558-9496c724-c290-41d8-8e60-151ea134ee32.png) 这是我的run脚本: ![260CY7X5}DOF1363S{4PJ`1](https://user-images.githubusercontent.com/65949265/180979681-50ffae98-917d-4ec1-9ad8-971e9dc6b334.png) 其中我的gpt2_configs配置换了其他的配置也出现同样的问题 ### Environment docker pull hpcaitech/colossalai:0.1.7 & 0.1.8 pip install transformers pip install titans 8张A100

### 🐛 Describe the bug Hi I'm training bert using sequence parallel in colossal ai according to this [link](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/bert/sequene_parallel). But my training loss is too large, and it seems the...

### 🐛 Describe the bug Hi I'm running squence_parallel for bert pre-training, but I got this problem ![b465612bbf66f370c18248b5d6f86bf](https://user-images.githubusercontent.com/38046403/177959217-96fcaf70-1829-4d85-80d2-277277584658.png) What could I do to solve this problem? Thanks! ### Environment CUDA...

### 🐛 Describe the bug When I run a [vit experiment](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/vision_transformer/hybrid_parallel) by the following command ``` node=76 prefix="srun --nodes=1 --gres=gpu:4 --cpus-per-task=4 --ntasks=1 -w SG-IDC1-10-51-2-$node" $prefix colossalai run --nproc_per_node 4 train_with_cifar10.py...

Colossal-AI implementation of MAE, [arxiv](https//arxiv.org/abs/2111.06377). As an example, we just cover the pretrain phase with ImageNet 1000 mini dataset. Helpers under subdir [util/](./util/) are from [facebookresearch/deit](https://github.com/facebookresearch/deit), under Apache License 2.0....

### 🐛 Describe the bug `models.shufflenet_v2_x1_0` can be trained with `BATCH_SIZE = 16384`, which cannot be run successfully with ColossalAI. The information is below: ```bash (conda-general) user@user:~/research/Experiments/ColossalAI-Examples/image/resnet$ colossalai run --nproc_per_node...