ColossalAI issues

How to evaluate the effect of PPO training in coati chat

1

I did the third step of PPO training, it was time consuming and unstable. The reward observed during training is between -300 and -10 as follows. Is this situation normal?...

guijuzhejiang

[BUG]: RuntimeError: CUDA error: unknown error

1

### 🐛 Describe the bug [04/17/23 20:35:20] INFO colossalai - colossalai - INFO: /home/lym/miniconda3/envs/lab3/lib/python3.9/site-p ackages/colossalai/context/parallel_context.py:522 set_device INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0...

LYMDLUT

bug

[FEATURE]: add parts for unsupervised learning

1

### Describe the feature Supposed we want a "professional language model" for specific industry (say maritime transportation). Is there a way to train large amont of text data (say papers,...

wanghetongtt

enhancement

[BUG]: RuntimeError(f"Parameter model.lm.parameters failed at the gradient reduction. " "Some unsupported torch function is operated upon this parameter.")

1

### 🐛 Describe the bug strategy:colossal_gemini print info: chunk.tensors_info[p].state TensorState.HOLD TensorState.HOLD_AFTER_BWD, -->so raise error: RuntimeError(f"Parameter model.lm.parameters failed at the gradient reduction. " "Some unsupported torch function is operated upon this...

Youly172

bug

[FEATURE]: train seq2seq model

2

### Describe the feature Is it not possible yet to train seq2seq model using colossalai? Given scripts are for causalLM models

Sahajtomar

enhancement

[FEATURE]: Graphic card ram friendly PPO training for big model(larger than 2B)

1

### Describe the feature The PPO training needs to maintain 4 models in memory at the same time. The original implementation keep the reward/actor critic/initial model in video ram at...

yynil

enhancement

[BUG]: Multiple node training error in VIT (2 nodes)

1

### 🐛 Describe the bug Command: colossalai run --nproc_per_node 1 --host gpu21,gpu11 --master_addr gpu21 train.py --config ./configs/vit_mutinode.py --dummy_data Failures: Traceback (most recent call last): File "/home/anaconda3/envs/colossalaipy39/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main...

fearless1007

bug

[BUG]: Chat train_sft.py SupervisedDataset: TypeError: init() got an unexpected keyword argument 'max_length'

2

Error Message: -------------------------------------- ninja: no work to do. Loading extension module fused_optim... Traceback (most recent call last): File "/workspace/ColossalAI/applications/Chat/examples/train_sft.py", line 187, in train(args) File "/workspace/ColossalAI/applications/Chat/examples/train_sft.py", line 107, in train train_dataset...

mikeda100

bug

build docker failed

4

#7 1268.3 [2/2] /usr/local/cuda/bin/nvcc -I/tmp/pip-req-build-uwisoelo/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -I/opt/conda/lib/python3.9/site-packages/torch/include -I/opt/conda/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.9/site-packages/torch/include/TH -I/opt/conda/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.9 -c -c /tmp/pip-req-build-uwisoelo/colossalai/kernel/cuda_native/csrc/moe_cuda_kernel.cu -o /tmp/pip-req-build-uwisoelo/build/temp.linux-x86_64-3.9/tmp/pip-req-build-uwisoelo/colossalai/kernel/cuda_native/csrc/moe_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda...

densechen

[BUG]: chat optimizer.backward(loss)

4

### 🐛 Describe the bug ColossalAI/applications/Chat/coati/trainer/strategies/colossalai.py file 150 line optimizer.backward(loss) should be a code error ### Environment _No response_

James6Chou

bug

ColossalAI
ColossalAI copied to clipboard

Metadata

How to evaluate the effect of PPO training in coati chat

[BUG]: RuntimeError: CUDA error: unknown error

[FEATURE]: add parts for unsupervised learning

[BUG]: RuntimeError(f"Parameter model.lm.parameters failed at the gradient reduction. " "Some unsupported torch function is operated upon this parameter.")

[FEATURE]: train seq2seq model

[FEATURE]: Graphic card ram friendly PPO training for big model(larger than 2B)

[BUG]: Multiple node training error in VIT (2 nodes)

[BUG]: Chat train_sft.py SupervisedDataset: TypeError: init() got an unexpected keyword argument 'max_length'

build docker failed

[BUG]: chat optimizer.backward(loss)

← Metadata

Owner

Metadata

ColossalAI ColossalAI copied to clipboard

Metadata

← Metadata

Owner

Metadata

ColossalAI
ColossalAI copied to clipboard