intel-extension-for-transformers icon indicating copy to clipboard operation
intel-extension-for-transformers copied to clipboard

An error occurred during DPO on NVIDIA GPU

Open yoyo20010808 opened this issue 1 year ago • 1 comments

I have changed some parameters in the training code as instructed, but when I run dpo on 8*A6000, I get these errors. If I understand correctly, habana is only used for hpu training.

Details

Traceback (most recent call last): File "/data1/yoyo/intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/examples/finetuning/dpo_pipeline/dpo_clm.py", line 219, in model_args, data_args, training_args, finetune_args = parser.parse_args_into_dataclasses() File "/root/anaconda3/envs/intel_eft/lib/python3.10/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 132, in init File "/root/anaconda3/envs/intel_eft/lib/python3.10/site-packages/optimum/habana/transformers/training_args.py", line 522, in post_init device_is_hpu = self.device.type == "hpu" File "/root/anaconda3/envs/intel_eft/lib/python3.10/site-packages/transformers/training_args.py", line 1901, in device return self._setup_devices File "/root/anaconda3/envs/intel_eft/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in get cached = self.fget(obj) File "/root/anaconda3/envs/intel_eft/lib/python3.10/site-packages/optimum/habana/transformers/training_args.py", line 679, in _setup_devices self.distributed_state = GaudiPartialState(cpu=False, backend=self.ddp_backend) File "/root/anaconda3/envs/intel_eft/lib/python3.10/site-packages/optimum/habana/accelerate/state.py", line 83, in init self.device = torch.device("cpu") if cpu else self.default_device File "/root/anaconda3/envs/intel_eft/lib/python3.10/site-packages/optimum/habana/accelerate/state.py", line 123, in default_device import habana_frameworks.torch.hpu as hthpu ModuleNotFoundError: No module named 'habana_frameworks'

This is the training script (I don’t know how to assign --device, I just added this parameter)
Details

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python dpo_clm.py \ --model_name_or_path "/data1/yoyo/intel-extension-for-transformers/data/Mistral-7B-v0.1" \ --output_dir "/data1/yoyo/intel-extension-for-transformers/out/dpo_test" \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 8 \ --learning_rate 5e-4 \ --max_steps 1000 \ --save_steps 10 \ --lora_alpha 16 \ --lora_rank 16 \ --lora_dropout 0.05 \ --dataset_name Intel/orca_dpo_pairs \ --bf16 \ --use_auth_token True \ --use_habana False \ --use_lazy_mode False \ --device "auto"

Also, when I run sft(finetune_neuralchat_v3.py), accelerate is automatically set to cpu

Details

[INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cpu (auto detect)

"No device has been set. Use either --use_habana to run on HPU or --no_cuda to run on CPU."

Operating system: CentOS 7 Python: 3.10 torch: 2.1.0 CUDA: 12.2 optimum-habana: 1.9.0 transformers: 4.34.1 accelerate: 0.25.0

yoyo20010808 avatar Dec 09 '23 10:12 yoyo20010808

hi,

  1. For nvidia gpu, you don't need install optimum-habana, because the code will check 'is_optimum_habana_available()' for habana device. So you can uninstall this package and don't need set "--use_habana" and "--use_lazy_mode ".
  2. The "DPOTrainer" inherits from huggingface/transformers "Trainer", so the device setting is same with it. if the environment has gpu, the code would check this and use it. If set "--use_cpu", the code will run on cpu.

Thanks~

lkk12014402 avatar Dec 13 '23 03:12 lkk12014402

Hi, I will close this issue if you don't have concerns

kevinintel avatar Jun 05 '24 07:06 kevinintel