ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: 使用colossalai run会报exception: [Errno 7] Argument list too long: '/bin/bash'

Open Cherishnoobs opened this issue 10 months ago • 3 comments

Is there an existing issue for this bug?

  • [x] I have searched the existing issues

The bug has not been fixed in the latest main branch

  • [x] I have checked the latest main branch

Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)

Yes, I will share a minimal reproducible script.

🐛 Describe the bug

run shell script as:

colossalai run --hostfile path-to-host-file --nproc_per_node 8 lora_finetune.py --pretrained path-to-DeepSeek-R1-bf16 --dataset path-to-dataset.jsonl --plugin moe --lr 2e-5 --max_length 256 -g --ep 8 --pp 3 --batch_size 24 --lora_rank 8 --lora_alpha 16 --num_epochs 2 --warmup_steps 8 --tensorboard_dir logs --save_dir DeepSeek-R1-bf16-lora

error as follows: Error: failed to run torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 lorafinetune.py --pretrained /mnt/model/deepseek_r1_Qwen_7b_fp16 --dataset/mnt/R1/R1_sft/data/lora_sft_data.jsonl --plugin moe --lr 2e-5 --maxlength 256 -g --ep 8 --pp 1 --batchsize 8 --lorarank 8 --loraalpha 16 --numepochs 2 --warmupsteps 8 --tensorboarddir logs --save_dir /mnt/model/DeepSeek-R1-bf16-lora on 10.48.54.157, is localhost: True, exception: [Errno 7] Argument list too long: '/bin/bash'

Environment

No response

Cherishnoobs avatar Feb 19 '25 18:02 Cherishnoobs

It seems that all underscores are missing in your command. What's your default shell?

ver217 avatar Feb 20 '25 04:02 ver217

It seems that all underscores are missing in your command. What's your default shell?

Bash,How to solve this issue?

Cherishnoobs avatar Feb 20 '25 11:02 Cherishnoobs

It seems that all underscores are missing in your command. What's your default shell?

And I realized, I just changed “colossalai run” to “ torchrun --nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr 127.0.0.1 --master_port 29500 lora_finetune.py ....”
It works. Why?

Cherishnoobs avatar Feb 20 '25 12:02 Cherishnoobs