ColossalAI
ColossalAI copied to clipboard
Making large AI models cheaper, faster and more accessible
PPO training is difficult to converge. It feels related to the hyperparameters num_episodes, max_epochs, max_timesteps, update_timesteps. How do you recommend setting these parameters?
### 🐛 Describe the bug When I train the stage3(PPO) in chat , the following error occurs: /home/zzg/workspace/pycharm/ColossalAI/applications/Chat/examples/train_prom │ │ pts_jp.py:303 in │ │ │ │ 300 │ parser.add_argument('--max_datasets_size', type=int,...
### 🐛 Describe the bug My execution command:torchrun --standalone --nproc_per_node=4 ./examples/train_sft.py --pretrain "/share/disk1/xiangchaoqi/gpt_test/pytorch_model-00033-of-00033.bin" --model 'llama' --strategy colossalai_zero2 --log_interval 10 --save_path output/Coati-7B --dataset /share/disk1/xiangchaoqi/gpt_test/instinwild_ch.json --batch_size 4 --accimulation_steps 8 --lr 2e-5 --max_datasets_size...
### 🐛 Describe the bug ModuleNotFoundError: No module named 'colossalai.kernel.op_builder' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9882) of binary: /home/admin/.conda/envs/colossalChat_py38_pytorch112/bin/python ### Environment _No response_
### 🐛 Describe the bug In stage 1 and stage 3, the size (about 13GB) of the training model is very big using the peft training way. I have set...
### 🐛 Describe the bug ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 130182) of binary: /usr/bin/python3.8 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345,...
### 🐛 Describe the bug colossalai 使用pytorch 1.13.1时提示元类冲突 命令 CUDA_VISIBLE_DEVICES=2 torchrun --standalone --nproc_per_node=1 pretrain.py log: ``` Traceback (most recent call last): File "pretrain.py", line 66, in from strategies import DDPStrategy,...
### 🐛 Describe the bug In this path 'ColossalAI/applications/Chat/examples/train_sft.sh', LLama-7B model is trained with Lora training method, but there is a problem in the reasoning process, is it because Lora...
### 🐛 Describe the bug hello, here is a bug, similar with [issues-3534](https://github.com/hpcaitech/ColossalAI/issues/3534) use default Anthropic/hh-rlhf dataset, pretrain_model: bigscience/bloom-1b1 batch_size: 1 max_epochs: 1 max_len: 512 loss_fn: log_sig loss is random...
### 🐛 Describe the bug use default rm_static dataset, set train_data to 75000 pretrain_model: bloomz-1b1 batch_size: 8 max_epochs: 4 max_len: 256 machine: 2 v100 32g loss_fn: log_sig after 3hours train...