ColossalAI
                                
                                 ColossalAI copied to clipboard
                                
                                    ColossalAI copied to clipboard
                            
                            
                            
                        Making large AI models cheaper, faster and more accessible
I did the third step of PPO training, it was time consuming and unstable. The reward observed during training is between -300 and -10 as follows. Is this situation normal?...
### 🐛 Describe the bug [04/17/23 20:35:20] INFO colossalai - colossalai - INFO: /home/lym/miniconda3/envs/lab3/lib/python3.9/site-p ackages/colossalai/context/parallel_context.py:522 set_device INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0...
### Describe the feature Supposed we want a "professional language model" for specific industry (say maritime transportation). Is there a way to train large amont of text data (say papers,...
### 🐛 Describe the bug strategy:colossal_gemini print info: chunk.tensors_info[p].state TensorState.HOLD TensorState.HOLD_AFTER_BWD, -->so raise error: RuntimeError(f"Parameter model.lm.parameters failed at the gradient reduction. " "Some unsupported torch function is operated upon this...
### Describe the feature Is it not possible yet to train seq2seq model using colossalai? Given scripts are for causalLM models
### Describe the feature The PPO training needs to maintain 4 models in memory at the same time. The original implementation keep the reward/actor critic/initial model in video ram at...
### 🐛 Describe the bug Command: colossalai run --nproc_per_node 1 --host gpu21,gpu11 --master_addr gpu21 train.py --config ./configs/vit_mutinode.py --dummy_data Failures: Traceback (most recent call last): File "/home/anaconda3/envs/colossalaipy39/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main...
Error Message: -------------------------------------- ninja: no work to do. Loading extension module fused_optim... Traceback (most recent call last): File "/workspace/ColossalAI/applications/Chat/examples/train_sft.py", line 187, in train(args) File "/workspace/ColossalAI/applications/Chat/examples/train_sft.py", line 107, in train train_dataset...
#7 1268.3 [2/2] /usr/local/cuda/bin/nvcc -I/tmp/pip-req-build-uwisoelo/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -I/opt/conda/lib/python3.9/site-packages/torch/include -I/opt/conda/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.9/site-packages/torch/include/TH -I/opt/conda/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.9 -c -c /tmp/pip-req-build-uwisoelo/colossalai/kernel/cuda_native/csrc/moe_cuda_kernel.cu -o /tmp/pip-req-build-uwisoelo/build/temp.linux-x86_64-3.9/tmp/pip-req-build-uwisoelo/colossalai/kernel/cuda_native/csrc/moe_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda...
### 🐛 Describe the bug ColossalAI/applications/Chat/coati/trainer/strategies/colossalai.py file 150 line optimizer.backward(loss) should be a code error ### Environment _No response_