ColossalAI issues

The results of PPO training are not so good

1

PPO training is difficult to converge. It feels related to the hyperparameters num_episodes, max_epochs, max_timesteps, update_timesteps. How do you recommend setting these parameters?

guijuzhejiang

[BUG]: PPO errors

4

### 🐛 Describe the bug When I train the stage3（PPO） in chat , the following error occurs： /home/zzg/workspace/pycharm/ColossalAI/applications/Chat/examples/train_prom │ │ pts_jp.py:303 in │ │ │ │ 300 │ parser.add_argument('--max_datasets_size', type=int,...

guijuzhejiang

bug

[BUG]: No module named 'colossalai.kernel.op_builder'

3

### 🐛 Describe the bug My execution command：torchrun --standalone --nproc_per_node=4 ./examples/train_sft.py --pretrain "/share/disk1/xiangchaoqi/gpt_test/pytorch_model-00033-of-00033.bin" --model 'llama' --strategy colossalai_zero2 --log_interval 10 --save_path output/Coati-7B --dataset /share/disk1/xiangchaoqi/gpt_test/instinwild_ch.json --batch_size 4 --accimulation_steps 8 --lr 2e-5 --max_datasets_size...

TinyQi

bug

[BUG]: ModuleNotFoundError: No module named 'colossalai.kernel.op_builder'

1

### 🐛 Describe the bug ModuleNotFoundError: No module named 'colossalai.kernel.op_builder' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9882) of binary: /home/admin/.conda/envs/colossalChat_py38_pytorch112/bin/python ### Environment _No response_

Pattaro

bug

[Question]: Why is the save training model so big using the peft training way?

1

### 🐛 Describe the bug In stage 1 and stage 3, the size (about 13GB) of the training model is very big using the peft training way. I have set...

HaixHan

bug

[BUG]: torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

1

### 🐛 Describe the bug ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 130182) of binary: /usr/bin/python3.8 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345,...

wccccp

bug

[BUG]: metaclass conflict

1

### 🐛 Describe the bug colossalai 使用pytorch 1.13.1时提示元类冲突命令 CUDA_VISIBLE_DEVICES=2 torchrun --standalone --nproc_per_node=1 pretrain.py log: ``` Traceback (most recent call last): File "pretrain.py", line 66, in from strategies import DDPStrategy,...

kkangjiawei

bug

[BUG]: The LLama model trained in Lora mode is unable to perform normal reasoning

1

### 🐛 Describe the bug In this path 'ColossalAI/applications/Chat/examples/train_sft.sh', LLama-7B model is trained with Lora training method, but there is a problem in the reasoning process, is it because Lora...

tianbuwei

bug

[BUG]: train_rm.py get lower acc!

3

### 🐛 Describe the bug hello, here is a bug, similar with [issues-3534](https://github.com/hpcaitech/ColossalAI/issues/3534) use default Anthropic/hh-rlhf dataset, pretrain_model: bigscience/bloom-1b1 batch_size: 1 max_epochs: 1 max_len: 512 loss_fn: log_sig loss is random...

Yutongamber

bug

[BUG]:train_reward_model acc is low

1

### 🐛 Describe the bug use default rm_static dataset, set train_data to 75000 pretrain_model: bloomz-1b1 batch_size: 8 max_epochs: 4 max_len: 256 machine: 2 v100 32g loss_fn: log_sig after 3hours train...

guoweigang

bug

ColossalAI
ColossalAI copied to clipboard

Metadata

The results of PPO training are not so good

[BUG]: PPO errors

[BUG]: No module named 'colossalai.kernel.op_builder'

[BUG]: ModuleNotFoundError: No module named 'colossalai.kernel.op_builder'

[Question]: Why is the save training model so big using the peft training way?

[BUG]: torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

[BUG]: metaclass conflict

[BUG]: The LLama model trained in Lora mode is unable to perform normal reasoning

[BUG]: train_rm.py get lower acc!

[BUG]:train_reward_model acc is low

← Metadata

Owner

Metadata

ColossalAI ColossalAI copied to clipboard

Metadata

← Metadata

Owner

Metadata

ColossalAI
ColossalAI copied to clipboard