run deepspeed_chat example code error #313 [BUG]

Open bestpredicts opened this issue 2 years ago • 0 comments

Describe the bug run example code bash training_scripts/single_node/run_1.3b.sh get error

To Reproduce bash training_scripts/single_node/run_1.3b.sh

Expected behavior A clear and concise description of what you expected to happen.

ds_report output Please run ds_report to give us details about your setup.

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.10/site-packages/torch']
torch version .................... 2.0.0
deepspeed install path ........... ['/opt/conda/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.1+cc67f22f, cc67f22f, master
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7

Screenshots

Time to load fused_adam op: 0.10290169715881348 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10215353965759277 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.1021888256072998 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10500884056091309 seconds
load data done.

Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.09672141075134277 seconds
[2023-04-15 15:31:34,267] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.1+cc67f22f, git-hash=cc67f22f, git-branch=master
[2023-04-15 15:31:34,272] [INFO] [comm.py:580:init_distributed] Distributed backend already initialized
[2023-04-15 15:31:45,812] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41526
[2023-04-15 15:31:46,842] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41527
[2023-04-15 15:31:46,842] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41528
[2023-04-15 15:31:46,844] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41529
[2023-04-15 15:31:46,845] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41530
[2023-04-15 15:31:46,847] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41531
[2023-04-15 15:31:46,848] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41532
[2023-04-15 15:31:46,849] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41533
[2023-04-15 15:31:46,850] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/bin/python', '-u', 'main.py', '--local_rank=7', '--model_name_or_path', '/code/tmp/pretrained_model/opt-1.3b', '--gradient_accumulation_steps', '2', '--zero_stage', '2', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '4', '--max_seq_len', '512', '--learning_rate', '1e-5', '--deepspeed', '--output_dir', './output'] exits with return code = -7
^C

System info (please complete the following information):

OS: [e.g. Ubuntu 18.04]
GPU count and types [e.g. two machines with x8 A100s each]
(if applicable) what DeepSpeed-MII version are you using
(if applicable) Hugging Face Transformers/Accelerate/etc. versions
Python version
Any other relevant info about your setup

Docker context pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel

Additional context Add any other context about the problem here.

Apr 15 '23 15:04 bestpredicts