DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

stuck running >>bash training_scripts/single_gpu/run_1.3b.sh

Open woshialex opened this issue 1 year ago • 1 comments

I run it on Ubunu 20.04 with 2 3090 cards, it always get stuck, py-spy dump shows:

Process 46930: /home/qi/anaconda3/envs/deepspeed/bin/python -u main.py --local_rank=0 --model_name_or_path facebook/opt-1.3b --gradient_accumulation_steps 2 --lora_dim 128 --zero_stage 0 --deepspeed --output_dir ./output Python v3.11.2 (/home/qi/anaconda3/envs/deepspeed/bin/python3.11)

Thread 46930 (idle): "MainThread" wait (torch/utils/file_baton.py:42) _jit_compile (torch/utils/cpp_extension.py:1522) load (torch/utils/cpp_extension.py:1284) jit_load (deepspeed/ops/op_builder/builder.py:480) load (deepspeed/ops/op_builder/builder.py:449) init (deepspeed/runtime/engine.py:377) initialize (deepspeed/init.py:156) main (main.py:273) (main.py:328) Thread 47073 (idle): "Thread-1" wait (threading.py:324) wait (threading.py:622) run (tqdm/_monitor.py:60) _bootstrap_inner (threading.py:1038) _bootstrap (threading.py:995)

After killing the process, the output file shows something like this: } } Using /home/qi/.cache/torch_extensions/py311_cu117 as PyTorch extensions root... Traceback (most recent call last): File "/home/qi/Documents/github/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 328, in [2023-04-14 16:00:01,336] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 46930 main() File "/home/qi/Documents/github/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 273, in main model, optimizer, _, lr_scheduler = deepspeed.initialize( ^^^^^^^^^^^^^^^^^^^^^ File "/home/qi/anaconda3/envs/deepspeed/lib/python3.11/site-packages/deepspeed/init.py", line 156, in initialize engine = DeepSpeedEngine(args=args, ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/qi/anaconda3/envs/deepspeed/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 377, in init util_ops = UtilsBuilder().load() ^^^^^^^^^^^^^^^^^^^^^

woshialex avatar Apr 14 '23 08:04 woshialex