DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] RuntimeError: Step 1 exited with non-zero status

Open huhuhu5798 opened this issue 1 year ago • 0 comments

I run python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu in colab,but got errors.

training.log file:

2023-05-11 07:32:43.560027: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [2023-05-11 07:32:44,353] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-05-11 07:32:44,366] [INFO] [runner.py:541:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --model_name_or_path facebook/opt-1.3b --gradient_accumulation_steps 8 --lora_dim 128 --zero_stage 0 --deepspeed --output_dir /content/drive/MyDrive/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b 2023-05-11 07:32:48.755708: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [2023-05-11 07:32:49,623] [INFO] [launch.py:222:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.16.2-1+cuda11.8 [2023-05-11 07:32:49,624] [INFO] [launch.py:222:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.16.2-1 [2023-05-11 07:32:49,624] [INFO] [launch.py:222:main] 0 NCCL_VERSION=2.16.2-1 [2023-05-11 07:32:49,624] [INFO] [launch.py:222:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev [2023-05-11 07:32:49,624] [INFO] [launch.py:222:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.16.2-1+cuda11.8 [2023-05-11 07:32:49,624] [INFO] [launch.py:222:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2 [2023-05-11 07:32:49,624] [INFO] [launch.py:222:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.16.2-1 [2023-05-11 07:32:49,624] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]} [2023-05-11 07:32:49,624] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-05-11 07:32:49,625] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-05-11 07:32:49,625] [INFO] [launch.py:247:main] dist_world_size=1 [2023-05-11 07:32:49,625] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0 2023-05-11 07:32:53.667801: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [2023-05-11 07:32:56,654] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl

Downloading (…)okenizer_config.json: 0%| | 0.00/685 [00:00<?, ?B/s] Downloading (…)okenizer_config.json: 100%|██████████| 685/685 [00:00<00:00, 4.12MB/s]

Downloading (…)lve/main/config.json: 0%| | 0.00/653 [00:00<?, ?B/s] Downloading (…)lve/main/config.json: 100%|██████████| 653/653 [00:00<00:00, 4.02MB/s]

Downloading (…)olve/main/vocab.json: 0%| | 0.00/899k [00:00<?, ?B/s] Downloading (…)olve/main/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 26.3MB/s]

Downloading (…)olve/main/merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s] Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 1.08MB/s] Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 1.08MB/s]

Downloading (…)cial_tokens_map.json: 0%| | 0.00/441 [00:00<?, ?B/s] Downloading (…)cial_tokens_map.json: 100%|██████████| 441/441 [00:00<00:00, 2.64MB/s]

Downloading pytorch_model.bin: 0%| | 0.00/2.63G [00:00<?, ?B/s] Downloading pytorch_model.bin: 1%| | 31.5M/2.63G [00:00<00:09, 283MB/s] Downloading pytorch_model.bin: 3%|▎ | 83.9M/2.63G [00:00<00:06, 402MB/s] Downloading pytorch_model.bin: 5%|▌ | 136M/2.63G [00:00<00:06, 399MB/s] Downloading pytorch_model.bin: 7%|▋ | 178M/2.63G [00:00<00:06, 374MB/s] Downloading pytorch_model.bin: 8%|▊ | 220M/2.63G [00:00<00:06, 361MB/s] Downloading pytorch_model.bin: 10%|▉ | 262M/2.63G [00:00<00:06, 346MB/s] Downloading pytorch_model.bin: 12%|█▏ | 304M/2.63G [00:00<00:06, 353MB/s] Downloading pytorch_model.bin: 13%|█▎ | 346M/2.63G [00:00<00:06, 363MB/s] Downloading pytorch_model.bin: 15%|█▍ | 388M/2.63G [00:01<00:06, 364MB/s] Downloading pytorch_model.bin: 17%|█▋ | 440M/2.63G [00:01<00:05, 385MB/s] Downloading pytorch_model.bin: 18%|█▊ | 482M/2.63G [00:01<00:05, 368MB/s] Downloading pytorch_model.bin: 20%|█▉ | 524M/2.63G [00:01<00:06, 304MB/s] Downloading pytorch_model.bin: 22%|██▏ | 566M/2.63G [00:01<00:07, 279MB/s] Downloading pytorch_model.bin: 23%|██▎ | 598M/2.63G [00:01<00:07, 269MB/s] Downloading pytorch_model.bin: 24%|██▍ | 629M/2.63G [00:01<00:07, 261MB/s] Downloading pytorch_model.bin: 25%|██▌ | 661M/2.63G [00:02<00:07, 256MB/s] Downloading pytorch_model.bin: 26%|██▋ | 692M/2.63G [00:02<00:07, 253MB/s] Downloading pytorch_model.bin: 27%|██▋ | 724M/2.63G [00:02<00:07, 251MB/s] Downloading pytorch_model.bin: 29%|██▊ | 755M/2.63G [00:02<00:07, 249MB/s] Downloading pytorch_model.bin: 30%|██▉ | 786M/2.63G [00:02<00:07, 250MB/s] Downloading pytorch_model.bin: 31%|███ | 818M/2.63G [00:02<00:07, 253MB/s] Downloading pytorch_model.bin: 32%|███▏ | 849M/2.63G [00:02<00:07, 249MB/s] Downloading pytorch_model.bin: 33%|███▎ | 881M/2.63G [00:02<00:06, 250MB/s] Downloading pytorch_model.bin: 35%|███▍ | 912M/2.63G [00:03<00:08, 196MB/s] Downloading pytorch_model.bin: 36%|███▋ | 954M/2.63G [00:03<00:07, 235MB/s] Downloading pytorch_model.bin: 37%|███▋ | 986M/2.63G [00:03<00:06, 251MB/s] Downloading pytorch_model.bin: 39%|███▊ | 1.02G/2.63G [00:03<00:06, 250MB/s] Downloading pytorch_model.bin: 40%|███▉ | 1.05G/2.63G [00:03<00:06, 236MB/s] Downloading pytorch_model.bin: 41%|████ | 1.08G/2.63G [00:03<00:06, 253MB/s] Downloading pytorch_model.bin: 43%|████▎ | 1.12G/2.63G [00:03<00:05, 277MB/s] Downloading pytorch_model.bin: 44%|████▍ | 1.15G/2.63G [00:04<00:05, 279MB/s] Downloading pytorch_model.bin: 45%|████▌ | 1.20G/2.63G [00:04<00:04, 300MB/s] Downloading pytorch_model.bin: 47%|████▋ | 1.23G/2.63G [00:04<00:05, 278MB/s] Downloading pytorch_model.bin: 48%|████▊ | 1.26G/2.63G [00:04<00:04, 275MB/s] Downloading pytorch_model.bin: 49%|████▉ | 1.29G/2.63G [00:04<00:05, 257MB/s] Downloading pytorch_model.bin: 50%|█████ | 1.32G/2.63G [00:04<00:04, 270MB/s] Downloading pytorch_model.bin: 51%|█████▏ | 1.35G/2.63G [00:05<00:16, 76.4MB/s] Downloading pytorch_model.bin: 53%|█████▎ | 1.41G/2.63G [00:05<00:10, 116MB/s] Downloading pytorch_model.bin: 55%|█████▍ | 1.44G/2.63G [00:06<00:08, 134MB/s] Downloading pytorch_model.bin: 56%|█████▌ | 1.47G/2.63G [00:06<00:07, 158MB/s] Downloading pytorch_model.bin: 57%|█████▋ | 1.50G/2.63G [00:06<00:06, 183MB/s] Downloading pytorch_model.bin: 58%|█████▊ | 1.53G/2.63G [00:06<00:05, 206MB/s] Downloading pytorch_model.bin: 59%|█████▉ | 1.56G/2.63G [00:06<00:04, 225MB/s] Downloading pytorch_model.bin: 61%|██████ | 1.60G/2.63G [00:06<00:04, 241MB/s] Downloading pytorch_model.bin: 63%|██████▎ | 1.65G/2.63G [00:06<00:03, 252MB/s] Downloading pytorch_model.bin: 64%|██████▍ | 1.68G/2.63G [00:06<00:03, 254MB/s] Downloading pytorch_model.bin: 65%|██████▍ | 1.71G/2.63G [00:07<00:03, 256MB/s] Downloading pytorch_model.bin: 66%|██████▌ | 1.74G/2.63G [00:07<00:03, 255MB/s] Downloading pytorch_model.bin: 67%|██████▋ | 1.77G/2.63G [00:07<00:03, 263MB/s] Downloading pytorch_model.bin: 69%|██████▊ | 1.80G/2.63G [00:07<00:03, 267MB/s] Downloading pytorch_model.bin: 70%|██████▉ | 1.84G/2.63G [00:07<00:02, 268MB/s] Downloading pytorch_model.bin: 71%|███████ | 1.87G/2.63G [00:07<00:02, 271MB/s] Downloading pytorch_model.bin: 72%|███████▏ | 1.90G/2.63G [00:07<00:02, 276MB/s] Downloading pytorch_model.bin: 73%|███████▎ | 1.93G/2.63G [00:07<00:02, 263MB/s] Downloading pytorch_model.bin: 75%|███████▍ | 1.97G/2.63G [00:07<00:02, 299MB/s] Downloading pytorch_model.bin: 77%|███████▋ | 2.01G/2.63G [00:08<00:01, 318MB/s] Downloading pytorch_model.bin: 78%|███████▊ | 2.06G/2.63G [00:08<00:02, 256MB/s] Downloading pytorch_model.bin: 79%|███████▉ | 2.09G/2.63G [00:08<00:02, 256MB/s] Downloading pytorch_model.bin: 80%|████████ | 2.12G/2.63G [00:08<00:01, 260MB/s] Downloading pytorch_model.bin: 82%|████████▏ | 2.15G/2.63G [00:08<00:01, 260MB/s] Downloading pytorch_model.bin: 83%|████████▎ | 2.18G/2.63G [00:08<00:01, 267MB/s] Downloading pytorch_model.bin: 84%|████████▍ | 2.21G/2.63G [00:08<00:01, 252MB/s] Downloading pytorch_model.bin: 85%|████████▌ | 2.24G/2.63G [00:09<00:01, 253MB/s] Downloading pytorch_model.bin: 86%|████████▋ | 2.28G/2.63G [00:09<00:01, 259MB/s] Downloading pytorch_model.bin: 88%|████████▊ | 2.31G/2.63G [00:09<00:01, 268MB/s] Downloading pytorch_model.bin: 89%|████████▉ | 2.34G/2.63G [00:09<00:01, 277MB/s] Downloading pytorch_model.bin: 90%|█████████ | 2.37G/2.63G [00:09<00:00, 284MB/s] Downloading pytorch_model.bin: 91%|█████████ | 2.40G/2.63G [00:09<00:00, 289MB/s] Downloading pytorch_model.bin: 92%|█████████▏| 2.43G/2.63G [00:09<00:00, 290MB/s] Downloading pytorch_model.bin: 94%|█████████▎| 2.46G/2.63G [00:09<00:00, 293MB/s] Downloading pytorch_model.bin: 95%|█████████▍| 2.50G/2.63G [00:09<00:00, 290MB/s] Downloading pytorch_model.bin: 96%|█████████▌| 2.53G/2.63G [00:09<00:00, 294MB/s] Downloading pytorch_model.bin: 97%|█████████▋| 2.56G/2.63G [00:10<00:00, 296MB/s] Downloading pytorch_model.bin: 98%|█████████▊| 2.59G/2.63G [00:10<00:00, 224MB/s] Downloading pytorch_model.bin: 100%|█████████▉| 2.62G/2.63G [00:10<00:00, 233MB/s] Downloading pytorch_model.bin: 100%|██████████| 2.63G/2.63G [00:10<00:00, 251MB/s]

Downloading (…)neration_config.json: 0%| | 0.00/137 [00:00<?, ?B/s] Downloading (…)neration_config.json: 100%|██████████| 137/137 [00:00<00:00, 689kB/s]

Downloading metadata: 0%| | 0.00/926 [00:00<?, ?B/s] Downloading metadata: 100%|██████████| 926/926 [00:00<00:00, 818kB/s]

Downloading readme: 0%| | 0.00/530 [00:00<?, ?B/s] Downloading readme: 100%|██████████| 530/530 [00:00<00:00, 431kB/s] Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...

Downloading data files: 0%| | 0/2 [00:00<?, ?it/s]

Downloading data: 0%| | 0.00/68.4M [00:00<?, ?B/s] [A

Downloading data: 11%|█ | 7.34M/68.4M [00:00<00:00, 73.4MB/s] [A

Downloading data: 30%|██▉ | 20.3M/68.4M [00:00<00:00, 106MB/s] [A

Downloading data: 47%|████▋ | 31.9M/68.4M [00:00<00:00, 111MB/s] [A

Downloading data: 65%|██████▍ | 44.4M/68.4M [00:00<00:00, 117MB/s] [A

Downloading data: 82%|████████▏ | 56.1M/68.4M [00:00<00:00, 112MB/s] [A

Downloading data: 98%|█████████▊| 67.3M/68.4M [00:00<00:00, 105MB/s] [A Downloading data: 100%|██████████| 68.4M/68.4M [00:00<00:00, 107MB/s]

Downloading data files: 50%|█████ | 1/2 [00:01<00:01, 1.18s/it]

Downloading data: 0%| | 0.00/4.61M [00:00<?, ?B/s] [A Downloading data: 100%|██████████| 4.61M/4.61M [00:00<00:00, 47.9MB/s]

Downloading data files: 100%|██████████| 2/2 [00:01<00:00, 1.19it/s] Downloading data files: 100%|██████████| 2/2 [00:01<00:00, 1.12it/s]

Extracting data files: 0%| | 0/2 [00:00<?, ?it/s] Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 1750.54it/s]

Generating train split: 0%| | 0/76256 [00:00<?, ? examples/s] Generating train split: 13%|█▎ | 10000/76256 [00:00<00:00, 71482.08 examples/s] Generating train split: 39%|███▉ | 30000/76256 [00:00<00:00, 124281.93 examples/s] Generating train split: 79%|███████▊ | 60000/76256 [00:00<00:00, 145428.11 examples/s]

Generating test split: 0%| | 0/5103 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.

0%| | 0/2 [00:00<?, ?it/s] 100%|██████████| 2/2 [00:00<00:00, 223.48it/s] Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Creating extension directory /root/.cache/torch_extensions/py310_cu118/fused_adam... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu118/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [2023-05-11 07:34:44,819] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6488 [2023-05-11 07:34:44,846] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '8', '--lora_dim', '128', '--zero_stage', '0', '--deepspeed', '--output_dir', '/content/drive/MyDrive/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = -9 [1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -std=c++17 -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o [2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o [3/3] c++ fused_adam_frontend.o multi_tensor_adam.cuda.o -shared -L/usr/local/lib/python3.10/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_adam.so

huhuhu5798 avatar May 11 '23 10:05 huhuhu5798