exits with return code = -9
[2023-04-21 22:17:06,284] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]} [2023-04-21 22:17:06,284] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-04-21 22:17:06,284] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-04-21 22:17:06,284] [INFO] [launch.py:162:main] dist_world_size=1 [2023-04-21 22:17:06,284] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0 [2023-04-21 22:17:11,840] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 04/21/2023 22:17:12 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 04/21/2023 22:17:14 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-cc7d8860227c3483/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 33/33 [00:50<00:00, 1.55s/it] trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199 [2023-04-21 22:20:31,527] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 19501 [2023-04-21 22:20:31,529] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python3.8', '-u', '/hy-tmp/LMFlow-main/examples/finetune.py', '--local_rank=0', '--model_name_or_path', '/hy-tmp/models/llama-7b-hf', '--dataset_path', '/hy-tmp/LMFlow-main/data/alpaca/train', '--output_dir', '/hy-tmp/models/new_7b', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '1e-4', '--block_size', '16', '--per_device_train_batch_size', '1', '--use_lora', '1', '--lora_r', '8', '--save_aggregated_lora', '0', '--deepspeed', '/hy-tmp/LMFlow-main/configs/ds_config_zero2.json', '--bf16', '--run_name', 'finetune_with_lora', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = -9
The model's size is about 13GB, why 62GB mem is not enough? I saw the process of python3's mem to reach 62GB, then killed by OS. By the way, GPU mem usage is 1MiB / 24576MiB all the time.
sad...I meet the same problem and have no idea on it.
Usually, it requires about 200GB RAM for fine-tuning. https://github.com/OptimalScale/LMFlow/issues/179 I think it is related to the Deepspeed strategy.
This issue has been marked as stale because it has not had recent activity. If you think this still needs to be addressed please feel free to reopen this issue. Thanks