DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG]the following arguments are required: user_script, user_args

Open TinyQi opened this issue 1 year ago • 6 comments

when I run cmd like {python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --num-gpus 1},this error happens。how can i fix it?

ERROR LOG: usage: ds [-h] [-H HOSTFILE] [-i INCLUDE] [-e EXCLUDE] [--num_nodes NUM_NODES] [--min_elastic_nodes MIN_ELASTIC_NODES] [--max_elastic_nodes MAX_ELASTIC_NODES] [--num_gpus NUM_GPUS] [--master_port MASTER_PORT] [--master_addr MASTER_ADDR] [--launcher LAUNCHER] [--launcher_args LAUNCHER_ARGS] [--module] [--no_python] [--no_local_rank] [--no_ssh_check] [--force_multi] [--save_pid] [--enable_each_rank_log ENABLE_EACH_RANK_LOG] [--autotuning {tune,run}] [--elastic_training] [--bind_cores_to_rank] [--bind_core_list BIND_CORE_LIST] user_script ... ds: error: the following arguments are required: user_script, user_args

my conda env is: accelerate 0.18.0 aiohttp 3.8.4 aiosignal 1.3.1 async-timeout 4.0.2 attrs 22.2.0 certifi 2022.5.18.1 charset-normalizer 3.1.0 cmake 3.26.3 datasets 2.11.0 deepspeed 0.9.0+unknown dill 0.3.6 filelock 3.11.0 frozenlist 1.3.3 fsspec 2023.4.0 hjson 3.1.0 huggingface-hub 0.13.4 idna 3.4 Jinja2 3.1.2 lit 16.0.1 MarkupSafe 2.1.2 mpmath 1.3.0 multidict 6.0.4 multiprocess 0.70.14 networkx 3.1 ninja 1.11.1 numpy 1.24.2 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.2.10.91 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusparse-cu11 11.7.4.91 nvidia-nccl-cu11 2.14.3 nvidia-nvtx-cu11 11.7.91 packaging 23.1 pandas 2.0.0 pip 21.2.4 protobuf 3.20.3 psutil 5.9.4 py-cpuinfo 9.0.0 pyarrow 11.0.0 pydantic 1.10.7 python-dateutil 2.8.2 pytz 2023.3 PyYAML 6.0 requests 2.28.2 responses 0.18.0 sentencepiece 0.1.98 setuptools 61.2.0 six 1.16.0 sympy 1.11.1 torch 2.0.0 tqdm 4.65.0 triton 2.0.0 typing_extensions 4.5.0 tzdata 2023.3 urllib3 1.26.15 wheel 0.37.1 xxhash 3.2.0 yarl 1.8.2

TinyQi avatar Apr 14 '23 03:04 TinyQi

@TinyQi which step does this occur on? Have you modified the scripts in any way? I'm not able to reproduce this error.

If you can, update DeepSpeedExamples repository with the latest changes and run the following: python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu

The output here should contain additional details about the error that you can share with us!

mrwyattii avatar Apr 14 '23 17:04 mrwyattii

谢谢你的回复 @mrwyattii ,我已经更新了最新的代码,但是仍然存在问题

一开始因为“^M”的问题,我无法运行脚本,所以我将“^M”去除后,继续运行,命令为:python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu。 然后还是发生了错误,错误信息如下: (deepSpeed) [root@gpt_xcq DeepSpeed-Chat]# python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu ---=== Running Step 1 ===--- Running: bash /share/disk1/xiangchaoqi/15.deepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh /share/disk1/xiangchaoqi/15.deepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b Traceback (most recent call last): File "train.py", line 210, in main(args) File "train.py", line 195, in main launch_cmd(args, step_num, cmd) File "train.py", line 175, in launch_cmd raise RuntimeError('\n\n'.join(( RuntimeError: Step 1 exited with non-zero status 2

Launch command: bash /share/disk1/xiangchaoqi/15.deepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh /share/disk1/xiangchaoqi/15.deepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b

Log output: /share/disk1/xiangchaoqi/15.deepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b/training.log

Please see our tutorial at https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning

Please check that you have installed our requirements: pip install -r requirements.txt

If you are seeing an OOM error, try modifying /share/disk1/xiangchaoqi/15.deepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh:

  • Reduce --per_device_*_batch_size

  • Increase --zero_stage {0,1,2,3} on multi-gpu setups

  • Enable --gradient_checkpointing or --only_optimizer_lora

具体的日志信息仍然跟之前的错误一样,如下: usage: ds [-h] [-H HOSTFILE] [-i INCLUDE] [-e EXCLUDE] [--num_nodes NUM_NODES] [--min_elastic_nodes MIN_ELASTIC_NODES] [--max_elastic_nodes MAX_ELASTIC_NODES] [--num_gpus NUM_GPUS] [--master_port MASTER_PORT] [--master_addr MASTER_ADDR] [--launcher LAUNCHER] [--launcher_args LAUNCHER_ARGS] [--module] [--no_python] [--no_local_rank] [--no_ssh_check] [--force_multi] [--save_pid] [--enable_each_rank_log ENABLE_EACH_RANK_LOG] [--autotuning {tune,run}] [--elastic_training] [--bind_cores_to_rank] [--bind_core_list BIND_CORE_LIST] user_script ... ds: error: the following arguments are required: user_script, user_args

@TinyQi which step does this occur on? Have you modified the scripts in any way? I'm not able to reproduce this error.

If you can, update DeepSpeedExamples repository with the latest changes and run the following: python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu

The output here should contain additional details about the error that you can share with us!

TinyQi avatar Apr 18 '23 03:04 TinyQi

If you want to add arguments to the training such as the ones you list above (e.g., --gradient_checkpointing) you'll need to add them after main.py in the script for example:

https://github.com/microsoft/DeepSpeedExamples/blob/2aa7a31b8fdcb34b8ccdc554021a1f5789752ab3/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh#L18-L20

deepspeed --num_gpus 1 main.py --gradient_checkpointing --model_name_or_path ...

The error you are showing is coming from our launcher not recognizing these arguments since they are intended to be consumed by main.py. Hope this helps.

jeffra avatar Apr 18 '23 18:04 jeffra

你的问题解决了吗哥们,我和你出现的问题一样

khai0617 avatar Apr 22 '23 10:04 khai0617

If you want to add arguments to the training such as the ones you list above (e.g., --gradient_checkpointing) you'll need to add them after main.py in the script for example:

https://github.com/microsoft/DeepSpeedExamples/blob/2aa7a31b8fdcb34b8ccdc554021a1f5789752ab3/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh#L18-L20

deepspeed --num_gpus 1 main.py --gradient_checkpointing --model_name_or_path ...

The error you are showing is coming from our launcher not recognizing these arguments since they are intended to be consumed by main.py. Hope this helps.

Strangely, I didn't use these parameters at all, like “user_script, user_args”

image

TinyQi avatar Apr 26 '23 02:04 TinyQi

你的问题解决了吗哥们,我和你出现的问题一样

木有呀,你解决了不

TinyQi avatar May 04 '23 07:05 TinyQi

Is your problem solved buddy, I have the same problem as you

No, did you solve it?

I was getting the same error, just used ds instead of deepspeed solved the problem like following:

ds --num_gpus 1 main.py --gradient_checkpointing --model_name_or_path ...

ArezouA avatar May 26 '23 20:05 ArezouA

I get same error,This was solved by removing the line breaks between deepspeed and train.py

xyzkk3 avatar Jun 20 '23 12:06 xyzkk3

Closing as the issue appears to be resolved.

loadams avatar Aug 15 '23 20:08 loadams

using ds instead of deepspeed fixes this for me. documentation should be changed!

earonesty avatar Aug 23 '23 22:08 earonesty

using ds instead of deepspeed fixes this for me. documentation should be changed!

Great!!! This worked for me! Thank you!

bochs-bs avatar May 11 '24 08:05 bochs-bs