LMFlow icon indicating copy to clipboard operation
LMFlow copied to clipboard

no error log and exits with return code = -7

Open delltower opened this issue 2 years ago • 3 comments

  • deepspeed --master_port=11000 examples/finetune.py --model_name_or_path /workspace/work/LMFlow/LMFlow/mydata/model/llama-7b-hf --save_aggregated_lora 0 --use_lora 1 --lora_r 8 --dataset_path /workspace/work/LMFlow/LMFlow/mydata/data/wiki_cn --block_size 512 --validation_split_percentage 0 --dataloader_num_workers 1 --output_dir /workspace/work/LMFlow/LMFlow/mydata/model/output_models/7b-wiki --overwrite_output_dir --num_train_epochs 0.01 --learning_rate 1e-4 --per_device_train_batch_size 4 --deepspeed configs/ds_config_zero2.json --bf16 --run_name finetune_with_lora --do_train --logging_steps 20 --ddp_timeout 72000 --save_steps 5000
  • tee /workspace/work/LMFlow/LMFlow/mydata/log/7b-wiki/train.log [2023-04-18 07:11:27,925] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-18 07:11:27,958] [INFO] [runner.py:550:main] cmd = /root/anaconda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --model_name_or_path /workspace/work/LMFlow/LMFlow/mydata/model/llama-7b-hf --save_aggregated_lora 0 --use_lora 1 --lora_r 8 --dataset_path /workspace/work/LMFlow/LMFlow/mydata/data/wiki_cn --block_size 512 --validation_split_percentage 0 --dataloader_num_workers 1 --output_dir /workspace/work/LMFlow/LMFlow/mydata/model/output_models/7b-wiki --overwrite_output_dir --num_train_epochs 0.01 --learning_rate 1e-4 --per_device_train_batch_size 4 --deepspeed configs/ds_config_zero2.json --bf16 --run_name finetune_with_lora --do_train --logging_steps 20 --ddp_timeout 72000 --save_steps 5000 [2023-04-18 07:11:30,092] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7 [2023-04-18 07:11:30,092] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.13.4-1 [2023-04-18 07:11:30,092] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1 [2023-04-18 07:11:30,092] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7 [2023-04-18 07:11:30,092] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev [2023-04-18 07:11:30,092] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2 [2023-04-18 07:11:30,092] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1 [2023-04-18 07:11:30,092] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]} [2023-04-18 07:11:30,092] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=4, node_rank=0 [2023-04-18 07:11:30,092] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]}) [2023-04-18 07:11:30,092] [INFO] [launch.py:162:main] dist_world_size=4 [2023-04-18 07:11:30,092] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 [2023-04-18 07:11:34,670] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 04/18/2023 07:11:35 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 04/18/2023 07:11:35 - WARNING - lmflow.pipeline.finetuner - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False 04/18/2023 07:11:35 - WARNING - lmflow.pipeline.finetuner - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False 04/18/2023 07:11:35 - WARNING - lmflow.pipeline.finetuner - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False 04/18/2023 07:11:36 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-9a361f23a7da4286/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 04/18/2023 07:11:36 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-9a361f23a7da4286/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 04/18/2023 07:11:36 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-9a361f23a7da4286/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 04/18/2023 07:11:36 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-9a361f23a7da4286/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████| 2/2 [00:17<00:00, 8.85s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████| 2/2 [00:21<00:00, 10.87s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████| 2/2 [00:21<00:00, 10.99s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████| 2/2 [00:23<00:00, 11.53s/it] trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199 trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199 trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199 trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199 Running tokenizer on dataset: 0%| | 0/41 [00:00<?, ? examples/s][WARNING|tokenization_utils_base.py:2432] 2023-04-18 07:19:19,788 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. [2023-04-18 07:20:57,699] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1338
[2023-04-18 07:20:57,699] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1339 [2023-04-18 07:20:57,737] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1340 [2023-04-18 07:20:57,738] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1341 [2023-04-18 07:20:59,606] [ERROR] [launch.py:324:sigkill_handler] ['/root/anaconda3/envs/lmflow/bin/python', '-u', 'examples/finetune.py', '--local_rank=3', '--model_name_or_path', '/workspace/work/LMFlow/LMFlow/mydata/model/llama-7b-hf', '--save_aggregated_lora', '0', '--use_lora', '1', '--lora_r', '8', '--dataset_path', '/workspace/work/LMFlow/LMFlow/mydata/data/wiki_cn', '--block_size', '512', '--validation_split_percentage', '0', '--dataloader_num_workers', '1', '--output_dir', '/workspace/work/LMFlow/LMFlow/mydata/model/output_models/7b-wiki', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '1e-4', '--per_device_train_batch_size', '4', '--deepspeed', 'configs/ds_config_zero2.json', '--bf16', '--run_name', 'finetune_with_lora', '--do_train', '--logging_steps', '20', '--ddp_timeout', '72000', '--save_steps', '5000'] exits with return code = -7

delltower avatar Apr 18 '23 07:04 delltower

The hardware settings: RAM:251G 4GPUs :A100*40G Please tell me how to debug? ^_^

delltower avatar Apr 18 '23 07:04 delltower

From Killing subprocess pid, it seems that the RAM is not sufficient for training. Could you track the RAM usage by htop?

shizhediao avatar Apr 18 '23 10:04 shizhediao

Also, please monitor the disk space usage. It may be caused by the insufficient disk space.

shizhediao avatar Apr 19 '23 03:04 shizhediao

This issue has been marked as stale because it has not had recent activity. If you think this still needs to be addressed please feel free to reopen this issue. Thanks

shizhediao avatar May 15 '23 00:05 shizhediao