DeepSpeedExamples
DeepSpeedExamples copied to clipboard
stuck running >>bash training_scripts/single_gpu/run_1.3b.sh
I run it on Ubunu 20.04 with 2 3090 cards, it always get stuck, py-spy dump shows:
Process 46930: /home/qi/anaconda3/envs/deepspeed/bin/python -u main.py --local_rank=0 --model_name_or_path facebook/opt-1.3b --gradient_accumulation_steps 2 --lora_dim 128 --zero_stage 0 --deepspeed --output_dir ./output Python v3.11.2 (/home/qi/anaconda3/envs/deepspeed/bin/python3.11)
Thread 46930 (idle): "MainThread"
wait (torch/utils/file_baton.py:42)
_jit_compile (torch/utils/cpp_extension.py:1522)
load (torch/utils/cpp_extension.py:1284)
jit_load (deepspeed/ops/op_builder/builder.py:480)
load (deepspeed/ops/op_builder/builder.py:449)
init (deepspeed/runtime/engine.py:377)
initialize (deepspeed/init.py:156)
main (main.py:273)
After killing the process, the output file shows something like this:
}
}
Using /home/qi/.cache/torch_extensions/py311_cu117 as PyTorch extensions root...
Traceback (most recent call last):
File "/home/qi/Documents/github/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 328, in