DPO training error `RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!`

Open Lopa07 opened this issue 1 year ago • 6 comments

Describe the bug Getting the following error only by changing the model to llava-onevision-qwen2-0_5b-ov from llava1_6-mistral-7b-instruct in the first DPO example here.

Command:

CUDA_VISIBLE_DEVICES=0,1,2 \
swift rlhf \
    --rlhf_type dpo \
    --model_type llava-onevision-qwen2-0_5b-ov \
    --beta 0.1 \
    --rpo_alpha 0.1 \
    --sft_type lora \
    --dataset rlaif-v#1000 \
    --num_train_epochs 2 \
    --lora_target_modules DEFAULT \
    --gradient_checkpointing true \
    --batch_size 1 \
    --learning_rate 5e-5 \
    --gradient_accumulation_steps 16 \
    --warmup_ratio 0.03 \
    --save_total_limit 2

Error:

Train:   0%|                                                          | 0/122 [00:00<?, ?it/s]/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:92: UserWarning: None of the inputs have requires_gr
ad=True. Gradients will be None                                                                                                                                                                                                                      
  warnings.warn(                                                                                                                                                                                                                                     
Traceback (most recent call last):                                                                                                                                                                                                                   
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/cli/rlhf.py", line 5, in <module>                                          
    rlhf_main()                                              
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/utils/run_utils.py", line 32, in x_main                                    
    result = llm_x(args, **kwargs)                           
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/rlhf.py", line 25, in llm_rlhf                                         
    return trainer_train(                                    
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/sft.py", line 455, in trainer_train                                    
    trainer.train(training_args.resume_from_checkpoint)                                                                   
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/trainers/mixin.py", line 424, in train                                     
    res = super().train(resume_from_checkpoint, *args, **kwargs)                                                          
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 2022, in train                                                                                                                       
    return inner_training_loop(                              
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 2358, in _inner_training_loop                                                                                                        
    tr_loss_step = self.training_step(model, inputs)                                                                      
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 3453, in training_step                                                                                                               
    loss = self.compute_loss(model, inputs)                  
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1520, in compute_loss                                                                                                             
    loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")                                        
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1438, in get_batch_loss_metrics                                                                                                   
    forward_output = self.concatenated_forward(model, batch)                                                              
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/trainers/mixin.py", line 716, in concatenated_forward                      
    outputs = model(**model_kwargs, use_cache=False)                                                                      
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl                                                                                                       
    return self._call_impl(*args, **kwargs)                  
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl                                                                                                               
    result = forward_call(*args, **kwargs)                   
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/accelerate/utils/operations.py", line 820, in forward                                                                                                               
    return model_forward(*args, **kwargs)                    
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/accelerate/utils/operations.py", line 808, in __call__                                                                                                              
    return convert_to_fp32(self.model_forward(*args, **kwargs))                                                           
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast                                                                                                          
    return func(*args, **kwargs)                             
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/peft/peft_model.py", line 1577, in forward                                                                                                                          
    return self.base_model(                                  
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl                                                                                                       
    return self._call_impl(*args, **kwargs)                  
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl                                                                                                               
    return forward_call(*args, **kwargs)                     
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/peft/tuners/tuners_utils.py", line 188, in forward                                                                                                                  
    return self.model.forward(*args, **kwargs)               
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/accelerate/hooks.py", line 170, in new_forward                                                                                                                      
    output = module._old_forward(*args, **kwargs)                                                                         
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/models/llava_onevision/modeling_llava_onevision.py", line 652, in forward                                                                              
    outputs = self.language_model(                           
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl                                                                                                       
    return self._call_impl(*args, **kwargs)                  
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl                                                                                                               
    return forward_call(*args, **kwargs)                     
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1160, in forward                                                                                                 
    outputs = self.model(                                    
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl                                                                                                       
    return self._call_impl(*args, **kwargs)                  
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl                                                                                                               
    return forward_call(*args, **kwargs)                     
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 938, in forward                                                                                                  
    causal_mask = self._update_causal_mask(                  
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1050, in _update_causal_mask                                                                                     
    causal_mask = _prepare_4d_causal_attention_mask_with_cache_position(                                                  
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 109, in _prepare_4d_causal_attention_mask_with_cache_position                                                    
    padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]                                  
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!                                                                                                                                      
Train:   0%|                                                          | 0/122 [00:01<?, ?it/s]

Your hardware and system info CUDA Version: 12.4 System: Ubuntu 22.04.3 LTS GPU torch==2.4.0 transformers==4.45.0.dev0 trl==0.10.1 peft==0.12.0

Sep 14 '24 17:09 Lopa07

NPROC_PER_NODE=3 \
CUDA_VISIBLE_DEVICES=0,1,2 \
swift rlhf \
    --rlhf_type dpo \
    --model_type llava-onevision-qwen2-0_5b-ov \
    --beta 0.1 \
    --rpo_alpha 0.1 \
    --sft_type lora \
    --dataset rlaif-v#1000 \
    --num_train_epochs 2 \
    --lora_target_modules DEFAULT \
    --gradient_checkpointing true \
    --batch_size 1 \
    --learning_rate 5e-5 \
    --gradient_accumulation_steps 16 \
    --warmup_ratio 0.03 \
    --save_total_limit 2 \
    --deepspeed default-zero2

Sep 17 '24 01:09 Jintao-Huang

Hello @Jintao-Huang, Sorry for the delayed response. Actually the above solution did not resolve the issue. The updated error with the above command is,

(swift) m.banerjee@PHYVDGPU03PRMV:/VDIL_COREML/m.banerjee/ms-swift$ NPROC_PER_NODE=3 \                                                                                                                                                               
CUDA_VISIBLE_DEVICES=0,1,2,3 \                                                                                                                                                                                                                       
swift rlhf \                                                                                                                                                                                                                                         
    --rlhf_type dpo \                                                                                                                                                                                                                                
    --model_type llava-onevision-qwen2-0_5b-ov \                                                                                                                                                                                                     
    --beta 0.1 \                                                                                                                                                                                                                                     
    --rpo_alpha 0.1 \                                                                                                                                                                                                                                
    --sft_type lora \                                                                                                                                                                                                                                
    --dataset rlaif-v#1000 \                                                                                                                                                                                                                         
    --num_train_epochs 2 \                                                                                                                                                                                                                           
    --lora_target_modules DEFAULT \                                                                                                                                                                                                                  
    --gradient_checkpointing true \                                                                                                                                                                                                                  
    --batch_size 1 \                                                                                                                                                                                                                                 
    --learning_rate 5e-5 \                                                                                                                                                                                                                           
    --gradient_accumulation_steps 16 \                                                                                                                                                                                                               
    --warmup_ratio 0.03 \                                                                                                                                                                                                                            
    --save_total_limit 2 \                                                                                                                                                                                                                           
    --deepspeed default-zero2                                                                                                                                                                                                                        
run sh: `/VDIL_COREML/m.banerjee/anaconda3/envs/swift/bin/python -m torch.distributed.run --nproc_per_node 3 /VDIL_COREML/m.banerjee/ms-swift/swift/cli/rlhf.py --rlhf_type dpo --model_type llava-onevision-qwen2-0_5b-ov --beta 0.1 --rpo_alpha 0.1
 --sft_type lora --dataset rlaif-v#1000 --num_train_epochs 2 --lora_target_modules DEFAULT --gradient_checkpointing true --batch_size 1 --learning_rate 5e-5 --gradient_accumulation_steps 16 --warmup_ratio 0.03 --save_total_limit 2 --deepspeed de
fault-zero2`                                                                                                                                                                                                                                         
WARNING:__main__:                                                                                                                                                                                                                                    
*****************************************                                                                                                                                                                                                            
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.                                     
*****************************************                                                                                                                                                                                                            
Traceback (most recent call last):                                                                                                                                                                                                                   
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/cli/rlhf.py", line 2, in <module>                                                                                                                                                                     
    from swift.llm import rlhf_main                                                                                                                                                                                                                  
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/__init__.py", line 5, in <module>                                                                                                                                                                 
    from .utils import *                                                                                                                                                                                                                             
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/__init__.py", line 3, in <module>                                                                                                                                                           
    from .argument import (AppUIArguments, DeployArguments, EvalArguments, ExportArguments, InferArguments, PtArguments,                                                                                                                             
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/argument.py", line 27, in <module>                                                                                                                                                          
    from .client_utils import get_model_list_client                                                                                                                                                                                                  
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/client_utils.py", line 18, in <module>                                                                                                                                                      
    from .utils import Messages, history_to_messages                                                                                                                                                                                                 
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/utils.py", line 1087, in <module>                                                                                                                                                           
    if is_ddp_plus_mp():                                                                                                                                                                                                                             
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/utils/torch_utils.py", line 137, in is_ddp_plus_mp                                                                                                                                                    
    if not is_mp():                                                                                                                                                                                                                                  
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/utils/torch_utils.py", line 128, in is_mp                                                                                                                                                             
    assert n_gpu % local_world_size == 0, f'n_gpu: {n_gpu}, local_world_size: {local_world_size}'                                                                                                                                                    
AssertionError: n_gpu: 4, local_world_size: 3                                                                                                                                                                                                        
Traceback (most recent call last):                                                                                                                                                                                                                   
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/cli/rlhf.py", line 2, in <module>                                                                                                                                                                     
    from swift.llm import rlhf_main                                                                                                                                                                                                                  
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/__init__.py", line 5, in <module>                                                                                                                                                                 
    from .utils import *                                                                                                                                                                                                                             
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/__init__.py", line 3, in <module>                                                                                                                                                           
    from .argument import (AppUIArguments, DeployArguments, EvalArguments, ExportArguments, InferArguments, PtArguments,                                                                                                                             
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/argument.py", line 27, in <module>                                                                                                                                                          
    from .client_utils import get_model_list_client                                                                                                                                                                                                  
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/client_utils.py", line 18, in <module>                                                                                                                                                      
    from .utils import Messages, history_to_messages                                                                                                                                                                                                 
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/utils.py", line 1087, in <module>                                                                                                                                                           
    if is_ddp_plus_mp():                                                                                                                                                                                                                             
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/utils/torch_utils.py", line 137, in is_ddp_plus_mp                                                                                                                                                    
    if not is_mp():                                                                                                                                                                                                                                  
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/utils/torch_utils.py", line 128, in is_mp                                  
    assert n_gpu % local_world_size == 0, f'n_gpu: {n_gpu}, local_world_size: {local_world_size}'                         
AssertionError: n_gpu: 4, local_world_size: 3                
Traceback (most recent call last):                           
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/cli/rlhf.py", line 2, in <module>                                          
    from swift.llm import rlhf_main                          
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/__init__.py", line 5, in <module>                                      
    from .utils import *                                     
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/__init__.py", line 3, in <module>                                
    from .argument import (AppUIArguments, DeployArguments, EvalArguments, ExportArguments, InferArguments, PtArguments,                                                                                                                             
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/argument.py", line 27, in <module>                               
    from .client_utils import get_model_list_client                                                                       
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/client_utils.py", line 18, in <module>                           
    from .utils import Messages, history_to_messages                                                                      
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/utils/utils.py", line 1087, in <module>                                
    if is_ddp_plus_mp():                                     
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/utils/torch_utils.py", line 137, in is_ddp_plus_mp                         
    if not is_mp():                                          
  File "/VDIL_COREML/m.banerjee/ms-swift/swift/utils/torch_utils.py", line 128, in is_mp                                  
    assert n_gpu % local_world_size == 0, f'n_gpu: {n_gpu}, local_world_size: {local_world_size}'                         
AssertionError: n_gpu: 4, local_world_size: 3                
W0920 13:05:13.217745 140295491486848 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3628011 closing signal SIGTERM                                                                                                           
E0920 13:05:13.225395 140295491486848 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3628009) of binary: /VDIL_COREML/m.banerjee/anaconda3/envs/swift/bin/python                                     
Traceback (most recent call last):                           
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/runpy.py", line 197, in _run_module_as_main                                                                                                                                       
    return _run_code(code, main_globals, None,               
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/runpy.py", line 87, in _run_code                       
    exec(code, run_globals)                                  
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/run.py", line 905, in <module>                                                                                                                    
    main()                                                   
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper                                                                                 
    return f(*args, **kwargs)                                
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main                                                                                                                        
    run(args)                                                
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run                                                                                                                         
    elastic_launch(                                          
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in __call__                                                                                                           
    return launch_agent(self._config, self._entrypoint, list(args))                                                       
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent                                                                                                       
    raise ChildFailedError(                                  
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:                                                        
============================================================                                                              
/VDIL_COREML/m.banerjee/ms-swift/swift/cli/rlhf.py FAILED                                                                 
------------------------------------------------------------                                                              
Failures:                                                    
[1]:                                                         
  time      : 2024-09-20_13:05:13                            
  host      : PHYVDGPU03PRMV.na.corp.samsungelectronics.net                                                               
  rank      : 1 (local_rank: 1)                              
  exitcode  : 1 (pid: 3628010)                               
  error_file: <N/A>                                          
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html                                
------------------------------------------------------------                                                              
Root Cause (first observed failure):                         
[0]:                                                         
  time      : 2024-09-20_13:05:13                            
  host      : PHYVDGPU03PRMV.na.corp.samsungelectronics.net                                                               
  rank      : 0 (local_rank: 0)                              
  exitcode  : 1 (pid: 3628009)                               
  error_file: <N/A>                                          
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html                                
============================================================

Sep 20 '24 20:09 Lopa07

Please re-open this issue until resolved.

Sep 20 '24 20:09 Lopa07

NPROC_PER_NODE=4 \                                                                                                                                                               
CUDA_VISIBLE_DEVICES=0,1,2,3 \                                                                                                                                                                                                                       
swift rlhf \                                                                                                                                                                                                                                         
    --rlhf_type dpo \                                                                                                                                                                                                                                
    --model_type llava-onevision-qwen2-0_5b-ov \                                                                                                                                                                                                     
    --beta 0.1 \                                                                                                                                                                                                                                     
    --rpo_alpha 0.1 \                                                                                                                                                                                                                                
    --sft_type lora \                                                                                                                                                                                                                                
    --dataset rlaif-v#1000 \                                                                                                                                                                                                                         
    --num_train_epochs 2 \                                                                                                                                                                                                                           
    --lora_target_modules DEFAULT \                                                                                                                                                                                                                  
    --gradient_checkpointing true \                                                                                                                                                                                                                  
    --batch_size 1 \                                                                                                                                                                                                                                 
    --learning_rate 5e-5 \                                                                                                                                                                                                                           
    --gradient_accumulation_steps 16 \                                                                                                                                                                                                               
    --warmup_ratio 0.03 \                                                                                                                                                                                                                            
    --save_total_limit 2 \                                                                                                                                                                                                                           
    --deepspeed default-zero2

Sep 21 '24 13:09 Jintao-Huang

Current command and error:

Command:

NPROC_PER_NODE=8 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift rlhf \
    --rlhf_type dpo \
    --model_type llava-onevision-qwen2-0_5b-ov \
    --beta 0.1 \
    --rpo_alpha 0.1 \
    --sft_type lora \
    --dataset rlaif-v#1000 \
    --num_train_epochs 2 \
    --lora_target_modules DEFAULT \
    --gradient_checkpointing true \
    --batch_size 1 \
    --learning_rate 5e-5 \
    --gradient_accumulation_steps 16 \
    --warmup_ratio 0.03 \
    --save_total_limit 2 \
    --deepspeed default-zero2

Error:

Train:   0%|                                                           | 0/14 [00:00<?, ?it/s]/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:92: UserWarning: None of the inputs have requires_gr
ad=True. Gradients will be None                              
  warnings.warn(                                             
/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:92: UserWarning: None of the inputs have requires_grad=True. Gradients will be None                                                               
  warnings.warn(                                             
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)                                                                                                             
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)                                                                                                             
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)                                                                                                             
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)                                                                                                             
/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.                     
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]                                                                                                                    
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)                                                                                                             
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)                                                                                                             
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)                                                                                                             
/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.                     
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]                                                                                                                    
/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.                     
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]                                                                                                                    
/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:92: UserWarning: None of the inputs have requires_grad=True. Gradients will be None                                                               
  warnings.warn(                                             
/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.                     
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]                                                                                                                    
/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.                     
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]                                                                                                                    
/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.                     
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]                                                                                                                    
/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.                     
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]                                                                                                                    
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)                                                                                                             
/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.                     
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]                                                                                                                    
[rank1]: Traceback (most recent call last):                  
[rank1]:   File "/VDIL_COREML/m.banerjee/ms-swift/swift/cli/rlhf.py", line 5, in <module>                                 
[rank1]:     rlhf_main()                                     
[rank1]:   File "/VDIL_COREML/m.banerjee/ms-swift/swift/utils/run_utils.py", line 32, in x_main                           
[rank1]:     result = llm_x(args, **kwargs)                  
[rank1]:   File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/rlhf.py", line 25, in llm_rlhf                                
[rank1]:     return trainer_train(                           
[rank1]:   File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/sft.py", line 456, in trainer_train                           
[rank1]:     trainer.train(training_args.resume_from_checkpoint)                                                          
[rank1]:   File "/VDIL_COREML/m.banerjee/ms-swift/swift/trainers/mixin.py", line 424, in train                            
[rank1]:     res = super().train(resume_from_checkpoint, *args, **kwargs)                                                 
[rank1]:   File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 2022, in train                                                                                                              
[rank1]:     return inner_training_loop(                     
[rank1]:   File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 2358, in _inner_training_loop                                                                                               
[rank1]:     tr_loss_step = self.training_step(model, inputs)                                                             
[rank1]:   File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 3453, in training_step                                                                                                      
[rank1]:     loss = self.compute_loss(model, inputs)                                                                      
[rank1]:   File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1520, in compute_loss                                                                                                    
[rank1]:     loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")                               
[rank1]:   File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1467, in get_batch_loss_metrics                                                                                          
[rank1]:     ) = self.concatenated_forward(self.model, batch)                                                             
[rank1]:   File "/VDIL_COREML/m.banerjee/ms-swift/swift/trainers/mixin.py", line 739, in concatenated_forward             
[rank1]:     return super().concatenated_forward(model, model_kwargs)                                                     
[rank1]:   File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1390, in concatenated_forward                                                                                            
[rank1]:     all_logps, size_completion = self.get_batch_logps(                                                           
[rank1]:   File "/VDIL_COREML/m.banerjee/ms-swift/swift/trainers/mixin.py", line 744, in get_batch_logps                  
[rank1]:     return super().get_batch_logps(logits, labels, *args, **kwargs)                                              
[rank1]:   File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1342, in get_batch_logps                                                                                                 
[rank1]:     per_token_logps = torch.gather(logits.log_softmax(-1), dim=2, index=labels.unsqueeze(2)).squeeze(2)          
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.36 GiB. GPU 1 has a total capacity of 47.50 GiB of which 1.52 GiB is free. Including non-PyTorch memory, this process has 45.97 GiB memory in use. Of the allocated memory 4
0.30 GiB is allocated by PyTorch, and 4.91 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory
 Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)                                      
[rank6]: Traceback (most recent call last):                  
[rank6]:   File "/VDIL_COREML/m.banerjee/ms-swift/swift/cli/rlhf.py", line 5, in <module>                                 
[rank6]:     rlhf_main()                                     
[rank6]:   File "/VDIL_COREML/m.banerjee/ms-swift/swift/utils/run_utils.py", line 32, in x_main                           
[rank6]:     result = llm_x(args, **kwargs)                  
[rank6]:   File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/rlhf.py", line 25, in llm_rlhf                                
[rank6]:     return trainer_train(                           
[rank6]:   File "/VDIL_COREML/m.banerjee/ms-swift/swift/llm/sft.py", line 456, in trainer_train                           
[rank6]:     trainer.train(training_args.resume_from_checkpoint)                                                          
[rank6]:   File "/VDIL_COREML/m.banerjee/ms-swift/swift/trainers/mixin.py", line 424, in train                            
[rank6]:     res = super().train(resume_from_checkpoint, *args, **kwargs)                                                 
[rank6]:   File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 2022, in train                                                                                                              
[rank6]:     return inner_training_loop(                     
[rank6]:   File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 2358, in _inner_training_loop                                                                                               
[rank6]:     tr_loss_step = self.training_step(model, inputs)                                                             
[rank6]:   File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 3453, in training_step                                                                                                      
[rank6]:     loss = self.compute_loss(model, inputs)                                                                      
[rank6]:   File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1520, in compute_loss                                                                                                    
[rank6]:     loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")                               
[rank6]:   File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1467, in get_batch_loss_metrics                                                                                          
[rank6]:     ) = self.concatenated_forward(self.model, batch)                                                             
[rank6]:   File "/VDIL_COREML/m.banerjee/ms-swift/swift/trainers/mixin.py", line 739, in concatenated_forward             
[rank6]:     return super().concatenated_forward(model, model_kwargs)                                                     
[rank6]:   File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1390, in concatenated_forward                                                                                            
[rank6]:     all_logps, size_completion = self.get_batch_logps(                                                           
[rank6]:   File "/VDIL_COREML/m.banerjee/ms-swift/swift/trainers/mixin.py", line 744, in get_batch_logps                  
[rank6]:     return super().get_batch_logps(logits, labels, *args, **kwargs)                                              
[rank6]:   File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1342, in get_batch_logps                                                                                                 
[rank6]:     per_token_logps = torch.gather(logits.log_softmax(-1), dim=2, index=labels.unsqueeze(2)).squeeze(2)          
[rank6]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.39 GiB. GPU 6 has a total capacity of 47.50 GiB of which 670.31 MiB is free. Including non-PyTorch memory, this process has 46.84 GiB memory in use. Of the allocated memory
 40.49 GiB is allocated by PyTorch, and 5.59 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)                                    
W0922 11:49:04.357401 139816835773568 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3968359 closing signal SIGTERM                                                                                                           
W0922 11:49:04.363104 139816835773568 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3968369 closing signal SIGTERM                                                                                                           
W0922 11:49:04.365078 139816835773568 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3968370 closing signal SIGTERM                                                                                                           
W0922 11:49:04.368553 139816835773568 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3968372 closing signal SIGTERM                                                                                                           
W0922 11:49:04.370596 139816835773568 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3968375 closing signal SIGTERM                                                                                                           
W0922 11:49:04.376917 139816835773568 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3968376 closing signal SIGTERM                                                                                                           
W0922 11:49:04.380629 139816835773568 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3968382 closing signal SIGTERM                                                                                                           
E0922 11:49:05.362424 139816835773568 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 3968366) of binary: /VDIL_COREML/m.banerjee/anaconda3/envs/swift/bin/python                                     
Traceback (most recent call last):                           
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/runpy.py", line 197, in _run_module_as_main            
    return _run_code(code, main_globals, None,               
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/runpy.py", line 87, in _run_code                       
    exec(code, run_globals)                                  
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/run.py", line 905, in <module>                                                                                                                    
    main()                                                   
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper                                                                                 
    return f(*args, **kwargs)                                
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main                                                                                                                        
    run(args)                                                
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run                                                                                                                         
    elastic_launch(                                          
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in __call__                                                                                                           
    return launch_agent(self._config, self._entrypoint, list(args))                                                       
  File "/VDIL_COREML/m.banerjee/anaconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent                                                                                                       
    raise ChildFailedError(                                  
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:                                                        
============================================================                                                              
/VDIL_COREML/m.banerjee/ms-swift/swift/cli/rlhf.py FAILED                                                                 
------------------------------------------------------------                                                              
Failures:                                                    
  <NO_OTHER_FAILURES>                                        
------------------------------------------------------------                                                              
Root Cause (first observed failure):                         
[0]:                                                         
  time      : 2024-09-22_11:49:04                            
  host      : PHYVDGPU03PRMV.na.corp.samsungelectronics.net                                                               
  rank      : 1 (local_rank: 1)                              
  exitcode  : 1 (pid: 3968366)                               
  error_file: <N/A>                                          
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html                                
============================================================

I am using one node with 8 x NVIDIA RTX 6000 Ada GPUs. The model llava-onevision-qwen2-0_5b-ov has a 0.5B parameter LM with the vision tower siglip-so400m-patch14-384. So should not have out of memory with 8 NVIDIA RTX 6000 Ada GPUs.

Sep 22 '24 18:09 Lopa07

Hello, I also encountered a similar problem here. The model I trained is InternVL2-8B, and the GPUs are 8*A100 40G. I have tried various methods for DPO training. Here are some of my experiences: First I tried using deepspeed, unfortunately even zero3 couldn't train. Then I tried the DDP+MP method based on best practices, which is the method you used. Apparently I encountered the OOM problem after training some steps (the same as you). In the end, I chose to use the MP method, and it worked. Compared with the first two methods, it is more time-consuming, but at least it works. From the analysis of my situation, I experienced OOM due to data reasons, which is equivalent to batch_size=2 during DPO training. However, my data plus image tokens obviously cannot be used for DPO training in the existing environment. , according to my understanding, DPO is similar to using swift for PEFT. When I use swift for Lora, the batch_size can only be set to a maximum of 1. Hope my answer can help you.

Sep 25 '24 11:09 bonre