DeepSpeed
DeepSpeed copied to clipboard
[BUG]the following arguments are required: user_script, user_args
when I run cmd like {python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --num-gpus 1},this error happens。how can i fix it?
ERROR LOG: usage: ds [-h] [-H HOSTFILE] [-i INCLUDE] [-e EXCLUDE] [--num_nodes NUM_NODES] [--min_elastic_nodes MIN_ELASTIC_NODES] [--max_elastic_nodes MAX_ELASTIC_NODES] [--num_gpus NUM_GPUS] [--master_port MASTER_PORT] [--master_addr MASTER_ADDR] [--launcher LAUNCHER] [--launcher_args LAUNCHER_ARGS] [--module] [--no_python] [--no_local_rank] [--no_ssh_check] [--force_multi] [--save_pid] [--enable_each_rank_log ENABLE_EACH_RANK_LOG] [--autotuning {tune,run}] [--elastic_training] [--bind_cores_to_rank] [--bind_core_list BIND_CORE_LIST] user_script ... ds: error: the following arguments are required: user_script, user_args
my conda env is: accelerate 0.18.0 aiohttp 3.8.4 aiosignal 1.3.1 async-timeout 4.0.2 attrs 22.2.0 certifi 2022.5.18.1 charset-normalizer 3.1.0 cmake 3.26.3 datasets 2.11.0 deepspeed 0.9.0+unknown dill 0.3.6 filelock 3.11.0 frozenlist 1.3.3 fsspec 2023.4.0 hjson 3.1.0 huggingface-hub 0.13.4 idna 3.4 Jinja2 3.1.2 lit 16.0.1 MarkupSafe 2.1.2 mpmath 1.3.0 multidict 6.0.4 multiprocess 0.70.14 networkx 3.1 ninja 1.11.1 numpy 1.24.2 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.2.10.91 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusparse-cu11 11.7.4.91 nvidia-nccl-cu11 2.14.3 nvidia-nvtx-cu11 11.7.91 packaging 23.1 pandas 2.0.0 pip 21.2.4 protobuf 3.20.3 psutil 5.9.4 py-cpuinfo 9.0.0 pyarrow 11.0.0 pydantic 1.10.7 python-dateutil 2.8.2 pytz 2023.3 PyYAML 6.0 requests 2.28.2 responses 0.18.0 sentencepiece 0.1.98 setuptools 61.2.0 six 1.16.0 sympy 1.11.1 torch 2.0.0 tqdm 4.65.0 triton 2.0.0 typing_extensions 4.5.0 tzdata 2023.3 urllib3 1.26.15 wheel 0.37.1 xxhash 3.2.0 yarl 1.8.2
@TinyQi which step does this occur on? Have you modified the scripts in any way? I'm not able to reproduce this error.
If you can, update DeepSpeedExamples repository with the latest changes and run the following:
python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu
The output here should contain additional details about the error that you can share with us!
谢谢你的回复 @mrwyattii ,我已经更新了最新的代码,但是仍然存在问题
一开始因为“^M”的问题,我无法运行脚本,所以我将“^M”去除后,继续运行,命令为:python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu。
然后还是发生了错误,错误信息如下:
(deepSpeed) [root@gpt_xcq DeepSpeed-Chat]# python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu
---=== Running Step 1 ===---
Running:
bash /share/disk1/xiangchaoqi/15.deepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh /share/disk1/xiangchaoqi/15.deepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b
Traceback (most recent call last):
File "train.py", line 210, in
Launch command: bash /share/disk1/xiangchaoqi/15.deepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh /share/disk1/xiangchaoqi/15.deepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b
Log output: /share/disk1/xiangchaoqi/15.deepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b/training.log
Please see our tutorial at https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning
Please check that you have installed our requirements: pip install -r requirements.txt
If you are seeing an OOM error, try modifying /share/disk1/xiangchaoqi/15.deepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh:
-
Reduce
--per_device_*_batch_size
-
Increase
--zero_stage {0,1,2,3}
on multi-gpu setups -
Enable
--gradient_checkpointing
or--only_optimizer_lora
具体的日志信息仍然跟之前的错误一样,如下: usage: ds [-h] [-H HOSTFILE] [-i INCLUDE] [-e EXCLUDE] [--num_nodes NUM_NODES] [--min_elastic_nodes MIN_ELASTIC_NODES] [--max_elastic_nodes MAX_ELASTIC_NODES] [--num_gpus NUM_GPUS] [--master_port MASTER_PORT] [--master_addr MASTER_ADDR] [--launcher LAUNCHER] [--launcher_args LAUNCHER_ARGS] [--module] [--no_python] [--no_local_rank] [--no_ssh_check] [--force_multi] [--save_pid] [--enable_each_rank_log ENABLE_EACH_RANK_LOG] [--autotuning {tune,run}] [--elastic_training] [--bind_cores_to_rank] [--bind_core_list BIND_CORE_LIST] user_script ... ds: error: the following arguments are required: user_script, user_args
@TinyQi which step does this occur on? Have you modified the scripts in any way? I'm not able to reproduce this error.
If you can, update DeepSpeedExamples repository with the latest changes and run the following:
python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu
The output here should contain additional details about the error that you can share with us!
If you want to add arguments to the training such as the ones you list above (e.g., --gradient_checkpointing) you'll need to add them after main.py
in the script for example:
https://github.com/microsoft/DeepSpeedExamples/blob/2aa7a31b8fdcb34b8ccdc554021a1f5789752ab3/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh#L18-L20
deepspeed --num_gpus 1 main.py --gradient_checkpointing --model_name_or_path ...
The error you are showing is coming from our launcher not recognizing these arguments since they are intended to be consumed by main.py
. Hope this helps.
你的问题解决了吗哥们,我和你出现的问题一样
If you want to add arguments to the training such as the ones you list above (e.g., --gradient_checkpointing) you'll need to add them after
main.py
in the script for example:https://github.com/microsoft/DeepSpeedExamples/blob/2aa7a31b8fdcb34b8ccdc554021a1f5789752ab3/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh#L18-L20
deepspeed --num_gpus 1 main.py --gradient_checkpointing --model_name_or_path ...
The error you are showing is coming from our launcher not recognizing these arguments since they are intended to be consumed by
main.py
. Hope this helps.
Strangely, I didn't use these parameters at all, like “user_script, user_args”
你的问题解决了吗哥们,我和你出现的问题一样
木有呀,你解决了不
Is your problem solved buddy, I have the same problem as you
No, did you solve it?
I was getting the same error, just used ds instead of deepspeed solved the problem like following:
ds --num_gpus 1 main.py --gradient_checkpointing --model_name_or_path ...
I get same error,This was solved by removing the line breaks between deepspeed and train.py
Closing as the issue appears to be resolved.
using ds
instead of deepspeed
fixes this for me. documentation should be changed!
using
ds
instead ofdeepspeed
fixes this for me. documentation should be changed!
Great!!! This worked for me! Thank you!