DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

run deepspeed_chat example code error

Open bestpredicts opened this issue 1 year ago • 3 comments

when I run code bash training_scripts/single_node/run_1.3b.sh , meet error

ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.0961456298828125 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10256075859069824 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10253238677978516 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10290169715881348 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10215353965759277 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.1021888256072998 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10500884056091309 seconds
load data done.

Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.09672141075134277 seconds
[2023-04-15 15:31:34,267] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.1+cc67f22f, git-hash=cc67f22f, git-branch=master
[2023-04-15 15:31:34,272] [INFO] [comm.py:580:init_distributed] Distributed backend already initialized
[2023-04-15 15:31:45,812] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41526
[2023-04-15 15:31:46,842] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41527
[2023-04-15 15:31:46,842] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41528
[2023-04-15 15:31:46,844] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41529
[2023-04-15 15:31:46,845] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41530
[2023-04-15 15:31:46,847] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41531
[2023-04-15 15:31:46,848] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41532
[2023-04-15 15:31:46,849] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41533
[2023-04-15 15:31:46,850] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/bin/python', '-u', 'main.py', '--local_rank=7', '--model_name_or_path', '/code/tmp/pretrained_model/opt-1.3b', '--gradient_accumulation_steps', '2', '--zero_stage', '2', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '4', '--max_seq_len', '512', '--learning_rate', '1e-5', '--deepspeed', '--output_dir', './output'] exits with return code = -7

bestpredicts avatar Apr 15 '23 15:04 bestpredicts