DPO training error `RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!`
Describe the bug
Getting the following error only by changing the model to llava-onevision-qwen2-0_5b-ov from llava1_6-mistral-7b-instruct in the first DPO example here.
Command:
CUDA_VISIBLE_DEVICES=0,1,2 \
swift rlhf \
--rlhf_type dpo \
--model_type llava-onevision-qwen2-0_5b-ov \
--beta 0.1 \
--rpo_alpha 0.1 \
--sft_type lora \
--dataset rlaif-v#1000 \
--num_train_epochs 2 \
--lora_target_modules DEFAULT \
--gradient_checkpointing true \
--batch_size 1 \
--learning_rate 5e-5 \
--gradient_accumulation_steps 16 \
--warmup_ratio 0.03 \
--save_total_limit 2
Error:
Train: 0%| | 0/122 [00:00<?, ?it/s]/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:92: UserWarning: None of the inputs have requires_gr
ad=True. Gradients will be None
warnings.warn(
Traceback (most recent call last):
File "/VDIL_COREML/m.banerjee/ms-swift/swift/cli/rlhf.py", line 5, in <module>
rlhf_main()
File "/VDIL_COREML/m.banerjee/ms-swift/swift/utils/run_utils.py", line 32, in x_main
result = llm_x(args, **kwargs)
File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/rlhf.py", line 25, in llm_rlhf
return trainer_train(
File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/sft.py", line 455, in trainer_train
trainer.train(training_args.resume_from_checkpoint)
File "/VDIL_COREML/m.banerjee/ms-swift/swift/trainers/mixin.py", line 424, in train
res = super().train(resume_from_checkpoint, *args, **kwargs)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 2022, in train
return inner_training_loop(
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 2358, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 3453, in training_step
loss = self.compute_loss(model, inputs)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1520, in compute_loss
loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1438, in get_batch_loss_metrics
forward_output = self.concatenated_forward(model, batch)
File "/VDIL_COREML/m.banerjee/ms-swift/swift/trainers/mixin.py", line 716, in concatenated_forward
outputs = model(**model_kwargs, use_cache=False)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
result = forward_call(*args, **kwargs)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/accelerate/utils/operations.py", line 820, in forward
return model_forward(*args, **kwargs)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/accelerate/utils/operations.py", line 808, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
return func(*args, **kwargs)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/peft/peft_model.py", line 1577, in forward
return self.base_model(
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/peft/tuners/tuners_utils.py", line 188, in forward
return self.model.forward(*args, **kwargs)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/models/llava_onevision/modeling_llava_onevision.py", line 652, in forward
outputs = self.language_model(
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1160, in forward
outputs = self.model(
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 938, in forward
causal_mask = self._update_causal_mask(
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1050, in _update_causal_mask
causal_mask = _prepare_4d_causal_attention_mask_with_cache_position(
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 109, in _prepare_4d_causal_attention_mask_with_cache_position
padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!
Train: 0%| | 0/122 [00:01<?, ?it/s]
Your hardware and system info CUDA Version: 12.4 System: Ubuntu 22.04.3 LTS GPU torch==2.4.0 transformers==4.45.0.dev0 trl==0.10.1 peft==0.12.0
NPROC_PER_NODE=3 \
CUDA_VISIBLE_DEVICES=0,1,2 \
swift rlhf \
--rlhf_type dpo \
--model_type llava-onevision-qwen2-0_5b-ov \
--beta 0.1 \
--rpo_alpha 0.1 \
--sft_type lora \
--dataset rlaif-v#1000 \
--num_train_epochs 2 \
--lora_target_modules DEFAULT \
--gradient_checkpointing true \
--batch_size 1 \
--learning_rate 5e-5 \
--gradient_accumulation_steps 16 \
--warmup_ratio 0.03 \
--save_total_limit 2 \
--deepspeed default-zero2
Hello @Jintao-Huang, Sorry for the delayed response. Actually the above solution did not resolve the issue. The updated error with the above command is,
(swift) m.banerjee@PHYVDGPU03PRMV:/VDIL_COREML/m.banerjee/ms-swift$ NPROC_PER_NODE=3 \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
swift rlhf \
--rlhf_type dpo \
--model_type llava-onevision-qwen2-0_5b-ov \
--beta 0.1 \
--rpo_alpha 0.1 \
--sft_type lora \
--dataset rlaif-v#1000 \
--num_train_epochs 2 \
--lora_target_modules DEFAULT \
--gradient_checkpointing true \
--batch_size 1 \
--learning_rate 5e-5 \
--gradient_accumulation_steps 16 \
--warmup_ratio 0.03 \
--save_total_limit 2 \
--deepspeed default-zero2
run sh: `/VDIL_COREML/m.banerjee/anaconda3/envs/swift/bin/python -m torch.distributed.run --nproc_per_node 3 /VDIL_COREML/m.banerjee/ms-swift/swift/cli/rlhf.py --rlhf_type dpo --model_type llava-onevision-qwen2-0_5b-ov --beta 0.1 --rpo_alpha 0.1
--sft_type lora --dataset rlaif-v#1000 --num_train_epochs 2 --lora_target_modules DEFAULT --gradient_checkpointing true --batch_size 1 --learning_rate 5e-5 --gradient_accumulation_steps 16 --warmup_ratio 0.03 --save_total_limit 2 --deepspeed de
fault-zero2`
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
File "/VDIL_COREML/m.banerjee/ms-swift/swift/cli/rlhf.py", line 2, in <module>
from swift.llm import rlhf_main
File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/__init__.py", line 5, in <module>
from .utils import *
File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/__init__.py", line 3, in <module>
from .argument import (AppUIArguments, DeployArguments, EvalArguments, ExportArguments, InferArguments, PtArguments,
File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/argument.py", line 27, in <module>
from .client_utils import get_model_list_client
File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/client_utils.py", line 18, in <module>
from .utils import Messages, history_to_messages
File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/utils.py", line 1087, in <module>
if is_ddp_plus_mp():
File "/VDIL_COREML/m.banerjee/ms-swift/swift/utils/torch_utils.py", line 137, in is_ddp_plus_mp
if not is_mp():
File "/VDIL_COREML/m.banerjee/ms-swift/swift/utils/torch_utils.py", line 128, in is_mp
assert n_gpu % local_world_size == 0, f'n_gpu: {n_gpu}, local_world_size: {local_world_size}'
AssertionError: n_gpu: 4, local_world_size: 3
Traceback (most recent call last):
File "/VDIL_COREML/m.banerjee/ms-swift/swift/cli/rlhf.py", line 2, in <module>
from swift.llm import rlhf_main
File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/__init__.py", line 5, in <module>
from .utils import *
File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/__init__.py", line 3, in <module>
from .argument import (AppUIArguments, DeployArguments, EvalArguments, ExportArguments, InferArguments, PtArguments,
File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/argument.py", line 27, in <module>
from .client_utils import get_model_list_client
File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/client_utils.py", line 18, in <module>
from .utils import Messages, history_to_messages
File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/utils.py", line 1087, in <module>
if is_ddp_plus_mp():
File "/VDIL_COREML/m.banerjee/ms-swift/swift/utils/torch_utils.py", line 137, in is_ddp_plus_mp
if not is_mp():
File "/VDIL_COREML/m.banerjee/ms-swift/swift/utils/torch_utils.py", line 128, in is_mp
assert n_gpu % local_world_size == 0, f'n_gpu: {n_gpu}, local_world_size: {local_world_size}'
AssertionError: n_gpu: 4, local_world_size: 3
Traceback (most recent call last):
File "/VDIL_COREML/m.banerjee/ms-swift/swift/cli/rlhf.py", line 2, in <module>
from swift.llm import rlhf_main
File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/__init__.py", line 5, in <module>
from .utils import *
File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/__init__.py", line 3, in <module>
from .argument import (AppUIArguments, DeployArguments, EvalArguments, ExportArguments, InferArguments, PtArguments,
File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/argument.py", line 27, in <module>
from .client_utils import get_model_list_client
File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/client_utils.py", line 18, in <module>
from .utils import Messages, history_to_messages
File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/utils.py", line 1087, in <module>
if is_ddp_plus_mp():
File "/VDIL_COREML/m.banerjee/ms-swift/swift/utils/torch_utils.py", line 137, in is_ddp_plus_mp
if not is_mp():
File "/VDIL_COREML/m.banerjee/ms-swift/swift/utils/torch_utils.py", line 128, in is_mp
assert n_gpu % local_world_size == 0, f'n_gpu: {n_gpu}, local_world_size: {local_world_size}'
AssertionError: n_gpu: 4, local_world_size: 3
W0920 13:05:13.217745 140295491486848 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3628011 closing signal SIGTERM
E0920 13:05:13.225395 140295491486848 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3628009) of binary: /VDIL_COREML/m.banerjee/anaconda3/envs/swift/bin/python
Traceback (most recent call last):
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/run.py", line 905, in <module>
main()
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/VDIL_COREML/m.banerjee/ms-swift/swift/cli/rlhf.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-09-20_13:05:13
host : PHYVDGPU03PRMV.na.corp.samsungelectronics.net
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3628010)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-20_13:05:13
host : PHYVDGPU03PRMV.na.corp.samsungelectronics.net
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3628009)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Please re-open this issue until resolved.
NPROC_PER_NODE=4 \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
swift rlhf \
--rlhf_type dpo \
--model_type llava-onevision-qwen2-0_5b-ov \
--beta 0.1 \
--rpo_alpha 0.1 \
--sft_type lora \
--dataset rlaif-v#1000 \
--num_train_epochs 2 \
--lora_target_modules DEFAULT \
--gradient_checkpointing true \
--batch_size 1 \
--learning_rate 5e-5 \
--gradient_accumulation_steps 16 \
--warmup_ratio 0.03 \
--save_total_limit 2 \
--deepspeed default-zero2
Current command and error:
Command:
NPROC_PER_NODE=8 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift rlhf \
--rlhf_type dpo \
--model_type llava-onevision-qwen2-0_5b-ov \
--beta 0.1 \
--rpo_alpha 0.1 \
--sft_type lora \
--dataset rlaif-v#1000 \
--num_train_epochs 2 \
--lora_target_modules DEFAULT \
--gradient_checkpointing true \
--batch_size 1 \
--learning_rate 5e-5 \
--gradient_accumulation_steps 16 \
--warmup_ratio 0.03 \
--save_total_limit 2 \
--deepspeed default-zero2
Error:
Train: 0%| | 0/14 [00:00<?, ?it/s]/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:92: UserWarning: None of the inputs have requires_gr
ad=True. Gradients will be None
warnings.warn(
/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:92: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:92: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
[rank1]: Traceback (most recent call last):
[rank1]: File "/VDIL_COREML/m.banerjee/ms-swift/swift/cli/rlhf.py", line 5, in <module>
[rank1]: rlhf_main()
[rank1]: File "/VDIL_COREML/m.banerjee/ms-swift/swift/utils/run_utils.py", line 32, in x_main
[rank1]: result = llm_x(args, **kwargs)
[rank1]: File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/rlhf.py", line 25, in llm_rlhf
[rank1]: return trainer_train(
[rank1]: File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/sft.py", line 456, in trainer_train
[rank1]: trainer.train(training_args.resume_from_checkpoint)
[rank1]: File "/VDIL_COREML/m.banerjee/ms-swift/swift/trainers/mixin.py", line 424, in train
[rank1]: res = super().train(resume_from_checkpoint, *args, **kwargs)
[rank1]: File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 2022, in train
[rank1]: return inner_training_loop(
[rank1]: File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 2358, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs)
[rank1]: File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 3453, in training_step
[rank1]: loss = self.compute_loss(model, inputs)
[rank1]: File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1520, in compute_loss
[rank1]: loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
[rank1]: File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1467, in get_batch_loss_metrics
[rank1]: ) = self.concatenated_forward(self.model, batch)
[rank1]: File "/VDIL_COREML/m.banerjee/ms-swift/swift/trainers/mixin.py", line 739, in concatenated_forward
[rank1]: return super().concatenated_forward(model, model_kwargs)
[rank1]: File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1390, in concatenated_forward
[rank1]: all_logps, size_completion = self.get_batch_logps(
[rank1]: File "/VDIL_COREML/m.banerjee/ms-swift/swift/trainers/mixin.py", line 744, in get_batch_logps
[rank1]: return super().get_batch_logps(logits, labels, *args, **kwargs)
[rank1]: File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1342, in get_batch_logps
[rank1]: per_token_logps = torch.gather(logits.log_softmax(-1), dim=2, index=labels.unsqueeze(2)).squeeze(2)
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.36 GiB. GPU 1 has a total capacity of 47.50 GiB of which 1.52 GiB is free. Including non-PyTorch memory, this process has 45.97 GiB memory in use. Of the allocated memory 4
0.30 GiB is allocated by PyTorch, and 4.91 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory
Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank6]: Traceback (most recent call last):
[rank6]: File "/VDIL_COREML/m.banerjee/ms-swift/swift/cli/rlhf.py", line 5, in <module>
[rank6]: rlhf_main()
[rank6]: File "/VDIL_COREML/m.banerjee/ms-swift/swift/utils/run_utils.py", line 32, in x_main
[rank6]: result = llm_x(args, **kwargs)
[rank6]: File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/rlhf.py", line 25, in llm_rlhf
[rank6]: return trainer_train(
[rank6]: File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/sft.py", line 456, in trainer_train
[rank6]: trainer.train(training_args.resume_from_checkpoint)
[rank6]: File "/VDIL_COREML/m.banerjee/ms-swift/swift/trainers/mixin.py", line 424, in train
[rank6]: res = super().train(resume_from_checkpoint, *args, **kwargs)
[rank6]: File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 2022, in train
[rank6]: return inner_training_loop(
[rank6]: File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 2358, in _inner_training_loop
[rank6]: tr_loss_step = self.training_step(model, inputs)
[rank6]: File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 3453, in training_step
[rank6]: loss = self.compute_loss(model, inputs)
[rank6]: File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1520, in compute_loss
[rank6]: loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
[rank6]: File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1467, in get_batch_loss_metrics
[rank6]: ) = self.concatenated_forward(self.model, batch)
[rank6]: File "/VDIL_COREML/m.banerjee/ms-swift/swift/trainers/mixin.py", line 739, in concatenated_forward
[rank6]: return super().concatenated_forward(model, model_kwargs)
[rank6]: File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1390, in concatenated_forward
[rank6]: all_logps, size_completion = self.get_batch_logps(
[rank6]: File "/VDIL_COREML/m.banerjee/ms-swift/swift/trainers/mixin.py", line 744, in get_batch_logps
[rank6]: return super().get_batch_logps(logits, labels, *args, **kwargs)
[rank6]: File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1342, in get_batch_logps
[rank6]: per_token_logps = torch.gather(logits.log_softmax(-1), dim=2, index=labels.unsqueeze(2)).squeeze(2)
[rank6]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.39 GiB. GPU 6 has a total capacity of 47.50 GiB of which 670.31 MiB is free. Including non-PyTorch memory, this process has 46.84 GiB memory in use. Of the allocated memory
40.49 GiB is allocated by PyTorch, and 5.59 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
W0922 11:49:04.357401 139816835773568 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3968359 closing signal SIGTERM
W0922 11:49:04.363104 139816835773568 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3968369 closing signal SIGTERM
W0922 11:49:04.365078 139816835773568 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3968370 closing signal SIGTERM
W0922 11:49:04.368553 139816835773568 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3968372 closing signal SIGTERM
W0922 11:49:04.370596 139816835773568 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3968375 closing signal SIGTERM
W0922 11:49:04.376917 139816835773568 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3968376 closing signal SIGTERM
W0922 11:49:04.380629 139816835773568 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3968382 closing signal SIGTERM
E0922 11:49:05.362424 139816835773568 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 3968366) of binary: /VDIL_COREML/m.banerjee/anaconda3/envs/swift/bin/python
Traceback (most recent call last):
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/run.py", line 905, in <module>
main()
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/VDIL_COREML/m.banerjee/ms-swift/swift/cli/rlhf.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-22_11:49:04
host : PHYVDGPU03PRMV.na.corp.samsungelectronics.net
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3968366)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
I am using one node with 8 x NVIDIA RTX 6000 Ada GPUs. The model llava-onevision-qwen2-0_5b-ov has a 0.5B parameter LM with the vision tower siglip-so400m-patch14-384. So should not have out of memory with 8 NVIDIA RTX 6000 Ada GPUs.
Hello, I also encountered a similar problem here. The model I trained is InternVL2-8B, and the GPUs are 8*A100 40G. I have tried various methods for DPO training. Here are some of my experiences: First I tried using deepspeed, unfortunately even zero3 couldn't train. Then I tried the DDP+MP method based on best practices, which is the method you used. Apparently I encountered the OOM problem after training some steps (the same as you). In the end, I chose to use the MP method, and it worked. Compared with the first two methods, it is more time-consuming, but at least it works. From the analysis of my situation, I experienced OOM due to data reasons, which is equivalent to batch_size=2 during DPO training. However, my data plus image tokens obviously cannot be used for DPO training in the existing environment. , according to my understanding, DPO is similar to using swift for PEFT. When I use swift for Lora, the batch_size can only be set to a maximum of 1. Hope my answer can help you.