stanford_alpaca
stanford_alpaca copied to clipboard
launch.py:sigkill_handler exits with return code = -11
我尝试过换成更小的模型,例如https://huggingface.co/中的gpt2-small-chinese-cluecorpussmall',但是得到的错误码是相同的情况。
(lmflow) [root@a4113ca43b08 LMFlow-main]# ./scripts/run_finetune.sh
[2023-04-15 13:15:13,800] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-15 13:15:13,815] [INFO] [runner.py:550:main] cmd = /home/minicoda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --model_name_or_path gpt2 --dataset_path /root/code/LMFlow-main/data/alpaca/train --output_dir /root/code/LMFlow-main/output_models/finetune --overwrite_output_dir --num_train_epochs 0.01 --learning_rate 2e-5 --block_size 512 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --bf16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
[2023-04-15 13:15:15,253] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-04-15 13:15:15,253] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-04-15 13:15:15,253] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-04-15 13:15:15,253] [INFO] [launch.py:162:main] dist_world_size=1
[2023-04-15 13:15:15,253] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-04-15 13:15:17,984] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
04/15/2023 13:15:19 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-0f19a1f3dd71ee72/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11184.81it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1577.99it/s]
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-0f19a1f3dd71ee72/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.
Downloading (…)lve/main/config.json: 100%|██████████████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 239kB/s]
Downloading (…)olve/main/vocab.json: 100%|█████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 2.63MB/s]
Downloading (…)olve/main/merges.txt: 100%|███████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 5.45MB/s]
Downloading (…)/main/tokenizer.json: 100%|█████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 2.43MB/s]
Downloading pytorch_model.bin: 48%|██████████████████████████████████▉ | 262M/548M [01:10<01:16, 3.73MB/s]Downloading pytorch_model.bin: 100%|█████████████████████████████████████████████████████████████████████████| 548M/548M [02:27<00:00, 3.72MB/s]
[2023-04-15 13:17:55,504] [INFO] [partition_parameters.py:415:__exit__] finished initializing model with 0.16B parameters
/home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
Downloading (…)neration_config.json: 100%|█████████████████████████████████████████████████████████████████████| 124/124 [00:00<00:00, 37.1kB/s]
04/15/2023 13:17:57 - WARNING - datasets.fingerprint - Parameter 'function'=<function HFDecoderModel.tokenize.<locals>.tokenize_function at 0x7fac0c5579d0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
Running tokenizer on dataset: 48%|█████████████████████████████▊ | 25000/52002 [00:04<00:05, 5278.05 examples/s][WARNING|tokenization_utils_base.py:3571] 2023-04-15 13:18:01,964 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1490 > 1024). Running this sequence through the model will result in indexing errors
[WARNING|hf_decoder_model.py:292] 2023-04-15 13:18:01,965 >> ^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits before being passed to the model.
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py39_cu117/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/minicoda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include/TH -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/minicoda3/envs/lmflow/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -c /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/minicoda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include/TH -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/minicoda3/envs/lmflow/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 20.41861844062805 seconds
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py39_cu117/utils...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include/TH -isystem /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/include/THC -isystem /home/minicoda3/envs/lmflow/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /home/minicoda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o
[2/2] c++ flatten_unflatten.o -shared -L/home/minicoda3/envs/lmflow/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 11.291367769241333 seconds
Parameter Offload: Total persistent parameters: 121344 in 98 params
[2023-04-15 13:18:45,476] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1222
[2023-04-15 13:18:45,476] [ERROR] [launch.py:324:sigkill_handler] ['/home/minicoda3/envs/lmflow/bin/python', '-u', 'examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'gpt2', '--dataset_path', '/root/code/LMFlow-main/data/alpaca/train', '--output_dir', '/root/code/LMFlow-main/output_models/finetune', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--block_size', '512', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--bf16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = -11