ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

Making large AI models cheaper, faster and more accessible

Results 1072 ColossalAI issues
Sort by recently updated
recently updated
newest added

I did the third step of PPO training, it was time consuming and unstable. The reward observed during training is between -300 and -10 as follows. Is this situation normal?...

### 🐛 Describe the bug [04/17/23 20:35:20] INFO colossalai - colossalai - INFO: /home/lym/miniconda3/envs/lab3/lib/python3.9/site-p ackages/colossalai/context/parallel_context.py:522 set_device INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0...

bug

### Describe the feature Supposed we want a "professional language model" for specific industry (say maritime transportation). Is there a way to train large amont of text data (say papers,...

enhancement

### 🐛 Describe the bug strategy:colossal_gemini print info: chunk.tensors_info[p].state TensorState.HOLD TensorState.HOLD_AFTER_BWD, -->so raise error: RuntimeError(f"Parameter model.lm.parameters failed at the gradient reduction. " "Some unsupported torch function is operated upon this...

bug

### Describe the feature Is it not possible yet to train seq2seq model using colossalai? Given scripts are for causalLM models

enhancement

### Describe the feature The PPO training needs to maintain 4 models in memory at the same time. The original implementation keep the reward/actor critic/initial model in video ram at...

enhancement

### 🐛 Describe the bug Command: colossalai run --nproc_per_node 1 --host gpu21,gpu11 --master_addr gpu21 train.py --config ./configs/vit_mutinode.py --dummy_data Failures: Traceback (most recent call last): File "/home/anaconda3/envs/colossalaipy39/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main...

bug

Error Message: -------------------------------------- ninja: no work to do. Loading extension module fused_optim... Traceback (most recent call last): File "/workspace/ColossalAI/applications/Chat/examples/train_sft.py", line 187, in train(args) File "/workspace/ColossalAI/applications/Chat/examples/train_sft.py", line 107, in train train_dataset...

bug

#7 1268.3 [2/2] /usr/local/cuda/bin/nvcc -I/tmp/pip-req-build-uwisoelo/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -I/opt/conda/lib/python3.9/site-packages/torch/include -I/opt/conda/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.9/site-packages/torch/include/TH -I/opt/conda/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.9 -c -c /tmp/pip-req-build-uwisoelo/colossalai/kernel/cuda_native/csrc/moe_cuda_kernel.cu -o /tmp/pip-req-build-uwisoelo/build/temp.linux-x86_64-3.9/tmp/pip-req-build-uwisoelo/colossalai/kernel/cuda_native/csrc/moe_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda...

### 🐛 Describe the bug ColossalAI/applications/Chat/coati/trainer/strategies/colossalai.py file 150 line optimizer.backward(loss) should be a code error ### Environment _No response_

bug