ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

Making large AI models cheaper, faster and more accessible

Results 1072 ColossalAI issues
Sort by recently updated
recently updated
newest added

PPO training is difficult to converge. It feels related to the hyperparameters num_episodes, max_epochs, max_timesteps, update_timesteps. How do you recommend setting these parameters?

### 🐛 Describe the bug When I train the stage3(PPO) in chat , the following error occurs: /home/zzg/workspace/pycharm/ColossalAI/applications/Chat/examples/train_prom │ │ pts_jp.py:303 in │ │ │ │ 300 │ parser.add_argument('--max_datasets_size', type=int,...

bug

### 🐛 Describe the bug My execution command:torchrun --standalone --nproc_per_node=4 ./examples/train_sft.py --pretrain "/share/disk1/xiangchaoqi/gpt_test/pytorch_model-00033-of-00033.bin" --model 'llama' --strategy colossalai_zero2 --log_interval 10 --save_path output/Coati-7B --dataset /share/disk1/xiangchaoqi/gpt_test/instinwild_ch.json --batch_size 4 --accimulation_steps 8 --lr 2e-5 --max_datasets_size...

bug

### 🐛 Describe the bug ModuleNotFoundError: No module named 'colossalai.kernel.op_builder' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9882) of binary: /home/admin/.conda/envs/colossalChat_py38_pytorch112/bin/python ### Environment _No response_

bug

### 🐛 Describe the bug In stage 1 and stage 3, the size (about 13GB) of the training model is very big using the peft training way. I have set...

bug

### 🐛 Describe the bug ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 130182) of binary: /usr/bin/python3.8 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345,...

bug

### 🐛 Describe the bug colossalai 使用pytorch 1.13.1时提示元类冲突 命令 CUDA_VISIBLE_DEVICES=2 torchrun --standalone --nproc_per_node=1 pretrain.py log: ``` Traceback (most recent call last): File "pretrain.py", line 66, in from strategies import DDPStrategy,...

bug

### 🐛 Describe the bug In this path 'ColossalAI/applications/Chat/examples/train_sft.sh', LLama-7B model is trained with Lora training method, but there is a problem in the reasoning process, is it because Lora...

bug

### 🐛 Describe the bug hello, here is a bug, similar with [issues-3534](https://github.com/hpcaitech/ColossalAI/issues/3534) use default Anthropic/hh-rlhf dataset, pretrain_model: bigscience/bloom-1b1 batch_size: 1 max_epochs: 1 max_len: 512 loss_fn: log_sig loss is random...

bug

### 🐛 Describe the bug use default rm_static dataset, set train_data to 75000 pretrain_model: bloomz-1b1 batch_size: 8 max_epochs: 4 max_len: 256 machine: 2 v100 32g loss_fn: log_sig after 3hours train...

bug