LMFlow icon indicating copy to clipboard operation
LMFlow copied to clipboard

multi-gpu full para train error

Open tankeui opened this issue 1 year ago • 2 comments

[2024-06-12 19:36:07,800] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-12 19:36:09,648] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-06-12 19:36:09,648] [INFO] [runner.py:568:main] cmd = anaconda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None LMFlow/examples/finetune.py --model_name_or_path huggingface/hub/Meta-Llama-3-70B --trust_remote_code 0 --dataset_path LMFlow/data/alpaca/train_conversation --output_dir output_models/finetune --overwrite_output_dir --conversation_template llama3 --num_train_epochs 0.01 --learning_rate 2e-5 --disable_group_texts 1 --block_size 256 --per_device_train_batch_size 1 --deepspeed LMFlow/configs/ds_config_zero3.json --fp16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1 [2024-06-12 19:36:11,661] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-12 19:36:12,366] [INFO] [launch.py:138:main] 0 TORCH_NCCL_BLOCKING_WAIT=1 [2024-06-12 19:36:12,366] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]} [2024-06-12 19:36:12,366] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0 [2024-06-12 19:36:12,366] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2024-06-12 19:36:12,366] [INFO] [launch.py:163:main] dist_world_size=2 [2024-06-12 19:36:12,366] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1 [2024-06-12 19:36:12,419] [INFO] [launch.py:253:main] process 40472 spawned with command: ['anaconda3/envs/lmflow/bin/python', '-u', 'LMFlow/examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'huggingface/hub/Meta-Llama-3-70B', '--trust_remote_code', '0', '--dataset_path', 'LMFlow/data/alpaca/train_conversation', '--output_dir', 'output_models/finetune', '--overwrite_output_dir', '--conversation_template', 'llama3', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'LMFlow/configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] [2024-06-12 19:36:12,466] [INFO] [launch.py:253:main] process 40473 spawned with command: ['anaconda3/envs/lmflow/bin/python', '-u', 'LMFlow/examples/finetune.py', '--local_rank=1', '--model_name_or_path', 'huggingface/hub/Meta-Llama-3-70B', '--trust_remote_code', '0', '--dataset_path', 'LMFlow/data/alpaca/train_conversation', '--output_dir', 'output_models/finetune', '--overwrite_output_dir', '--conversation_template', 'llama3', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'LMFlow/configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] [2024-06-12 19:36:17,298] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-12 19:36:17,298] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( [2024-06-12 19:36:20,965] [INFO] [comm.py:637:init_distributed] cdb=None [2024-06-12 19:36:20,965] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2024-06-12 19:36:20,965] [INFO] [comm.py:637:init_distributed] cdb=None [rank1]: Traceback (most recent call last): [rank1]: File "LMFlow/examples/finetune.py", line 61, in [rank1]: main() [rank1]: File "LMFlow/examples/finetune.py", line 44, in main [rank1]: model_args, data_args, pipeline_args = parser.parse_args_into_dataclasses() [rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses [rank1]: obj = dtype(**inputs) [rank1]: File "", line 135, in init [rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/training_args.py", line 1641, in post_init [rank1]: and (self.device.type == "cpu" and not is_torch_greater_or_equal_than_2_3) [rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/training_args.py", line 2149, in device [rank1]: return self._setup_devices [rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/utils/generic.py", line 59, in get [rank1]: cached = self.fget(obj) [rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/training_args.py", line 2077, in _setup_devices [rank1]: self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout)) [rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/accelerate/state.py", line 280, in init [rank1]: self.set_device() [rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/accelerate/state.py", line 790, in set_device [rank1]: torch.cuda.set_device(self.device) [rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/cuda/init.py", line 399, in set_device [rank1]: torch._C._cuda_setDevice(device) [rank1]: RuntimeError: CUDA error: invalid device ordinal [rank1]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

06/12/2024 19:36:21 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: True anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( [2024-06-12 19:36:22,477] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 40472 [2024-06-12 19:36:22,531] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 40473 [2024-06-12 19:36:22,531] [ERROR] [launch.py:322:sigkill_handler] ['anaconda3/envs/lmflow/bin/python', '-u', 'LMFlow/examples/finetune.py', '--local_rank=1', '--model_name_or_path', 'huggingface/hub/Meta-Llama-3-70B', '--trust_remote_code', '0', '--dataset_path', 'LMFlow/data/alpaca/train_conversation', '--output_dir', 'output_models/finetune', '--overwrite_output_dir', '--conversation_template', 'llama3', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'LMFlow/configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = 1

`#!/bin/bash

Please run this script under ${project_id} in project directory of

https://github.com/shizhediao/llm-ft

COMMIT: d5fecf30ba8011067b10cf51fede53a5ab6574e4

export TORCH_SHOW_CPP_STACKTRACES = 1

export TORCH_NCCL_BLOCKING_WAIT=1 export CUDA_LAUNCH_BLOCKING=1

export TORCH_USE_CUDA_DSA=1

Parses arguments

model_name_or_path=huggingface/hub/Meta-Llama-3-70B dataset_path=LMFlow/data/alpaca/train_conversation output_dir=output_models/finetune deepspeed_args="--num_gpus=2 --master_port=11000" conversation_template=llama3

Safety related arguments

trust_remote_code=0

while [[ $# -ge 1 ]]; do key="$1" case ${key} in -m|--model_name_or_path) model_name_or_path="$2" shift ;; -d|--dataset_path) dataset_path="$2" shift ;; -o|--output_model_path) output_dir="$2" shift ;; --conversation_template) conversation_template="$2" shift ;; --deepspeed_args) deepspeed_args="$2" shift ;; --trust_remote_code) trust_remote_code="$2" shift ;; *) echo "error: unknown option "${key}"" 1>&2 exit 1 esac shift done

Finetune

exp_id=finetune project_dir=$(cd "$(dirname $0)"/..; pwd) log_dir=${project_dir}/log/${exp_id} mkdir -p ${output_dir} ${log_dir}

deepspeed ${deepspeed_args}
LMFlow/examples/finetune.py
--model_name_or_path ${model_name_or_path}
--trust_remote_code ${trust_remote_code}
--dataset_path ${dataset_path}
--output_dir ${output_dir} --overwrite_output_dir
--conversation_template ${conversation_template}
--num_train_epochs 0.01
--learning_rate 2e-5
--disable_group_texts 1
--block_size 256
--per_device_train_batch_size 1
--deepspeed LMFlow/configs/ds_config_zero3.json
--fp16
--run_name finetune
--validation_split_percentage 0
--logging_steps 20
--do_train
--ddp_timeout 72000
--save_steps 5000
--dataloader_num_workers 1
| tee ${log_dir}/train.log
2> ${log_dir}/train.err`

How can I fix this problem?

tankeui avatar Jun 12 '24 11:06 tankeui

It seems like a cuda device mismatch issue.

[rank1]: RuntimeError: CUDA error: invalid device ordinal

I guess you've set CUDA_VISIBLE_DEVICES somewhere else accidently and leads to a mismatch. Maybe look at: https://stackoverflow.com/questions/64334033/how-to-solve-runtimeerror-cuda-error-invalid-device-ordinal Or, try change:

deepspeed_args="--num_gpus=2 --master_port=11000" 

to

deepspeed_args="--include localhost:x,x --master_port=11000"

wheresmyhair avatar Jun 12 '24 12:06 wheresmyhair

It seems like a cuda device mismatch issue.

[rank1]: RuntimeError: CUDA error: invalid device ordinal

I guess you've set CUDA_VISIBLE_DEVICES somewhere else accidently and leads to a mismatch. Maybe look at: https://stackoverflow.com/questions/64334033/how-to-solve-runtimeerror-cuda-error-invalid-device-ordinal Or, try change:

deepspeed_args="--num_gpus=2 --master_port=11000" 

to

deepspeed_args="--include localhost:x,x --master_port=11000"

Thanks, it solves my problem.

tankeui avatar Jun 12 '24 16:06 tankeui