LMFlow launch.py:sigkill_handler exits with return code = -11

我以为是设备资源的问题，因此也尝试将模型换成git-2 samll，也出现相同的错误。这个模型是是来自https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads&search=gpt2-small-chinese-cluecorpussmall的gpt2-small-chinese-cluecorpussmall。我怀疑问题可能并不在于设备的资源上。
(lmflow) [root@a4113ca43b08 LMFlow-main]# ./scripts/run_finetune.sh 
[2023-04-15 16:04:25,114] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-15 16:04:25,127] [INFO] [runner.py:550:main] cmd = /home/minicoda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --model_name_or_path gpt2 --dataset_path /root/code/LMFlow-main/data/alpaca/train --output_dir /root/code/LMFlow-main/output_models/finetune --overwrite_output_dir --num_train_epochs 0.01 --learning_rate 2e-5 --block_size 512 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --bf16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
[2023-04-15 16:04:26,580] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-04-15 16:04:26,580] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-04-15 16:04:26,580] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-04-15 16:04:26,580] [INFO] [launch.py:162:main] dist_world_size=1
[2023-04-15 16:04:26,580] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-04-15 16:04:29,298] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
04/15/2023 16:04:29 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-0f19a1f3dd71ee72/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...
Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 8176.03it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1682.43it/s]
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-0f19a1f3dd71ee72/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.
Downloading (…)lve/main/config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 230kB/s]
Downloading (…)olve/main/vocab.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 1.10MB/s]
Downloading (…)olve/main/merges.txt: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 694kB/s]
Downloading (…)/main/tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:01<00:00, 1.34MB/s]
Downloading pytorch_model.bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 548M/548M [02:28<00:00, 3.68MB/s]
[2023-04-15 16:07:14,940] [INFO] [partition_parameters.py:415:__exit__] finished initializing model with 0.16B parameters
/home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
Downloading (…)neration_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 124/124 [00:00<00:00, 39.3kB/s]
04/15/2023 16:07:16 - WARNING - datasets.fingerprint - Parameter 'function'=<function HFDecoderModel.tokenize.<locals>.tokenize_function at 0x7fdfc1b67700> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
Running tokenizer on dataset:  48%|██████████████████████████████████████████████████▉                                                       | 25000/52002 [00:04<00:05, 5163.46 examples/s][WARNING|tokenization_utils_base.py:3571] 2023-04-15 16:07:21,333 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1490 > 1024). Running this sequence through the model will result in indexing errors
[WARNING|hf_decoder_model.py:292] 2023-04-15 16:07:21,333 >> ^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits before being passed to the model.
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...                                                                                                                
Creating extension directory /root/.cache/torch_extensions/py39_cu117/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/minicoda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include/TH -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/minicoda3/envs/lmflow/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -c /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/minicoda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include/TH -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/minicoda3/envs/lmflow/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 19.9612398147583 seconds
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py39_cu117/utils...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include/TH -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include/THC -isystem /home/minicoda3/envs/lmflow/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o 
[2/2] c++ flatten_unflatten.o -shared -L/home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 11.367291927337646 seconds
Parameter Offload: Total persistent parameters: 121344 in 98 params
[2023-04-15 16:08:03,812] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 2917
[2023-04-15 16:08:03,812] [ERROR] [launch.py:324:sigkill_handler] ['/home/minicoda3/envs/lmflow/bin/python', '-u', 'examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'gpt2', '--dataset_path', '/root/code/LMFlow-main/data/alpaca/train', '--output_dir', '/root/code/LMFlow-main/output_models/finetune', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--block_size', '512', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--bf16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = -11
Apr 16 '23 06:04 zeroandexe
你好，默认的模型可以跑吗，也就是gpt2 这个模型会报错吗
Hi, Is the default model working well? will the gpt2 model generate the same error?
Apr 19 '23 03:04 shizhediao
This issue has been marked as stale because it has not had recent activity. If you think this still needs to be addressed please feel free to reopen this issue. Thanks
Jun 19 '23 11:06 shizhediao