[BUG]: 使用colossalai run会报exception: [Errno 7] Argument list too long: '/bin/bash'
Is there an existing issue for this bug?
- [x] I have searched the existing issues
The bug has not been fixed in the latest main branch
- [x] I have checked the latest main branch
Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)
Yes, I will share a minimal reproducible script.
🐛 Describe the bug
run shell script as:
colossalai run --hostfile path-to-host-file --nproc_per_node 8 lora_finetune.py --pretrained path-to-DeepSeek-R1-bf16 --dataset path-to-dataset.jsonl --plugin moe --lr 2e-5 --max_length 256 -g --ep 8 --pp 3 --batch_size 24 --lora_rank 8 --lora_alpha 16 --num_epochs 2 --warmup_steps 8 --tensorboard_dir logs --save_dir DeepSeek-R1-bf16-lora
error as follows: Error: failed to run torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 lorafinetune.py --pretrained /mnt/model/deepseek_r1_Qwen_7b_fp16 --dataset/mnt/R1/R1_sft/data/lora_sft_data.jsonl --plugin moe --lr 2e-5 --maxlength 256 -g --ep 8 --pp 1 --batchsize 8 --lorarank 8 --loraalpha 16 --numepochs 2 --warmupsteps 8 --tensorboarddir logs --save_dir /mnt/model/DeepSeek-R1-bf16-lora on 10.48.54.157, is localhost: True, exception: [Errno 7] Argument list too long: '/bin/bash'
Environment
No response
It seems that all underscores are missing in your command. What's your default shell?
It seems that all underscores are missing in your command. What's your default shell?
Bash,How to solve this issue?
It seems that all underscores are missing in your command. What's your default shell?
And I realized, I just changed
“colossalai run” to “ torchrun --nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr 127.0.0.1 --master_port 29500 lora_finetune.py ....”
It works. Why?