multi-gpu full para train error
[2024-06-12 19:36:07,800] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-12 19:36:09,648] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-06-12 19:36:09,648] [INFO] [runner.py:568:main] cmd = anaconda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None LMFlow/examples/finetune.py --model_name_or_path huggingface/hub/Meta-Llama-3-70B --trust_remote_code 0 --dataset_path LMFlow/data/alpaca/train_conversation --output_dir output_models/finetune --overwrite_output_dir --conversation_template llama3 --num_train_epochs 0.01 --learning_rate 2e-5 --disable_group_texts 1 --block_size 256 --per_device_train_batch_size 1 --deepspeed LMFlow/configs/ds_config_zero3.json --fp16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
[2024-06-12 19:36:11,661] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-12 19:36:12,366] [INFO] [launch.py:138:main] 0 TORCH_NCCL_BLOCKING_WAIT=1
[2024-06-12 19:36:12,366] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-06-12 19:36:12,366] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-06-12 19:36:12,366] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-06-12 19:36:12,366] [INFO] [launch.py:163:main] dist_world_size=2
[2024-06-12 19:36:12,366] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-06-12 19:36:12,419] [INFO] [launch.py:253:main] process 40472 spawned with command: ['anaconda3/envs/lmflow/bin/python', '-u', 'LMFlow/examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'huggingface/hub/Meta-Llama-3-70B', '--trust_remote_code', '0', '--dataset_path', 'LMFlow/data/alpaca/train_conversation', '--output_dir', 'output_models/finetune', '--overwrite_output_dir', '--conversation_template', 'llama3', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'LMFlow/configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1']
[2024-06-12 19:36:12,466] [INFO] [launch.py:253:main] process 40473 spawned with command: ['anaconda3/envs/lmflow/bin/python', '-u', 'LMFlow/examples/finetune.py', '--local_rank=1', '--model_name_or_path', 'huggingface/hub/Meta-Llama-3-70B', '--trust_remote_code', '0', '--dataset_path', 'LMFlow/data/alpaca/train_conversation', '--output_dir', 'output_models/finetune', '--overwrite_output_dir', '--conversation_template', 'llama3', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'LMFlow/configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1']
[2024-06-12 19:36:17,298] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-12 19:36:17,298] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-06-12 19:36:20,965] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-12 19:36:20,965] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-06-12 19:36:20,965] [INFO] [comm.py:637:init_distributed] cdb=None
[rank1]: Traceback (most recent call last):
[rank1]: File "LMFlow/examples/finetune.py", line 61, in TORCH_USE_CUDA_DSA to enable device-side assertions.
06/12/2024 19:36:21 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: True anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( [2024-06-12 19:36:22,477] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 40472 [2024-06-12 19:36:22,531] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 40473 [2024-06-12 19:36:22,531] [ERROR] [launch.py:322:sigkill_handler] ['anaconda3/envs/lmflow/bin/python', '-u', 'LMFlow/examples/finetune.py', '--local_rank=1', '--model_name_or_path', 'huggingface/hub/Meta-Llama-3-70B', '--trust_remote_code', '0', '--dataset_path', 'LMFlow/data/alpaca/train_conversation', '--output_dir', 'output_models/finetune', '--overwrite_output_dir', '--conversation_template', 'llama3', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'LMFlow/configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = 1
`#!/bin/bash
Please run this script under ${project_id} in project directory of
https://github.com/shizhediao/llm-ft
COMMIT: d5fecf30ba8011067b10cf51fede53a5ab6574e4
export TORCH_SHOW_CPP_STACKTRACES = 1
export TORCH_NCCL_BLOCKING_WAIT=1 export CUDA_LAUNCH_BLOCKING=1
export TORCH_USE_CUDA_DSA=1
Parses arguments
model_name_or_path=huggingface/hub/Meta-Llama-3-70B dataset_path=LMFlow/data/alpaca/train_conversation output_dir=output_models/finetune deepspeed_args="--num_gpus=2 --master_port=11000" conversation_template=llama3
Safety related arguments
trust_remote_code=0
while [[ $# -ge 1 ]]; do key="$1" case ${key} in -m|--model_name_or_path) model_name_or_path="$2" shift ;; -d|--dataset_path) dataset_path="$2" shift ;; -o|--output_model_path) output_dir="$2" shift ;; --conversation_template) conversation_template="$2" shift ;; --deepspeed_args) deepspeed_args="$2" shift ;; --trust_remote_code) trust_remote_code="$2" shift ;; *) echo "error: unknown option "${key}"" 1>&2 exit 1 esac shift done
Finetune
exp_id=finetune project_dir=$(cd "$(dirname $0)"/..; pwd) log_dir=${project_dir}/log/${exp_id} mkdir -p ${output_dir} ${log_dir}
deepspeed ${deepspeed_args}
LMFlow/examples/finetune.py
--model_name_or_path ${model_name_or_path}
--trust_remote_code ${trust_remote_code}
--dataset_path ${dataset_path}
--output_dir ${output_dir} --overwrite_output_dir
--conversation_template ${conversation_template}
--num_train_epochs 0.01
--learning_rate 2e-5
--disable_group_texts 1
--block_size 256
--per_device_train_batch_size 1
--deepspeed LMFlow/configs/ds_config_zero3.json
--fp16
--run_name finetune
--validation_split_percentage 0
--logging_steps 20
--do_train
--ddp_timeout 72000
--save_steps 5000
--dataloader_num_workers 1
| tee ${log_dir}/train.log
2> ${log_dir}/train.err`
How can I fix this problem?
It seems like a cuda device mismatch issue.
[rank1]: RuntimeError: CUDA error: invalid device ordinal
I guess you've set CUDA_VISIBLE_DEVICES somewhere else accidently and leads to a mismatch. Maybe look at: https://stackoverflow.com/questions/64334033/how-to-solve-runtimeerror-cuda-error-invalid-device-ordinal Or, try change:
deepspeed_args="--num_gpus=2 --master_port=11000"
to
deepspeed_args="--include localhost:x,x --master_port=11000"
It seems like a cuda device mismatch issue.
[rank1]: RuntimeError: CUDA error: invalid device ordinal
I guess you've set CUDA_VISIBLE_DEVICES somewhere else accidently and leads to a mismatch. Maybe look at: https://stackoverflow.com/questions/64334033/how-to-solve-runtimeerror-cuda-error-invalid-device-ordinal Or, try change:
deepspeed_args="--num_gpus=2 --master_port=11000"to
deepspeed_args="--include localhost:x,x --master_port=11000"
Thanks, it solves my problem.