accelerate
accelerate copied to clipboard
RuntimeError: Expected all tensors to be on the same device using Deepspeed with QLoRA and DPOTrainer
Couldn't find any similar other issues in accelerate, peft, or trl so I'm opening one here. When using the DPOTrainer on a single GPU with QLoRA I have no issues, but when I try to run the script with accelerate + deepspeed I keep getting "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0!".
main.py
import torch
from transformers import , AutoTokenizer, TrainingArguments, AutoModelForCausalLM
from transformers import BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig, PeftModel,
from trl import DPOTrainer
import bitsandbytes as bnb
model_name = "lmsys/vicuna-7b-v1.5"
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=False
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype = torch.float16,
quantization_config = quantization_config,
)
model.enable_input_require_grads()
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.padding_side = "left"
dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs")["train"]
# dataset prep
def process(example):
# Format system
if len(example['system']) > 0:
message = {"role": "system", "content": example['system']}
system = tokenizer.apply_chat_template([message], tokenize=False)
else:
system = ""
# Format instruction
message = {"role": "user", "content": example['input']}
prompt = tokenizer.apply_chat_template([message], tokenize=False, add_generation_prompt=True)
# Format chosen answer
chosen = example['chosen'] + tokenizer.eos_token
# Format rejected answer
rejected = example['rejected'] + tokenizer.eos_token
return {
"prompt": system + prompt,
"chosen": chosen,
"rejected": rejected,
}
dataset = dataset.map(process, remove_columns = dataset.column_names, batched=False)
# LoRA configuration
peft_config = LoraConfig(
r=48,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=['k_proj', 'v_proj', 'q_proj'],
)
# Training arguments
training_args = TrainingArguments(
per_device_train_batch_size=1,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False}, #https://github.com/huggingface/trl/issues/1136
fp16 = True,
learning_rate=5e-5,
lr_scheduler_type="cosine",
max_steps=400,
save_strategy="steps",
save_steps = 400,
save_total_limit=1,
logging_steps=1,
output_dir="./new_model",
warmup_ratio=0.03,
report_to="none",
deepspeed = "./zero2.json",
)
# Create DPO trainer
dpo_trainer = DPOTrainer(
model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
peft_config=peft_config,
beta=0.1,
max_prompt_length=512,
max_length=1024,
dataset_num_proc=4,
)
# Fine-tune model with DPO
dpo_trainer.train()
zero2.json
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"train_micro_batch_size_per_gpu": "auto",
"train_batch_size": "auto",
"gradient_accumulation_steps": "auto",
"zero_optimization": {
"stage": 2,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto"
}
}
accelerate config.yaml
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_config_file: ./zero2.json
zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
When commenting out deepspeed = "./zero2.json" in the TrainingArgs and executing the command below I have no issues;
CUDA_VISIBLE_DEVICES=0 python main.py
Instead if I run the script above using both the accelerate cli or deepspeed cli I get the same error:
accelerate launch --config_file ./config.yaml main.py
or
deepspeed main.py
both give me the following stack trace:
Stack Trace
Traceback (most recent call last):
File "/home/nathaniel/llava/dpo-slerp/./vicuna_dpo.py", line 107, in <module>
dpo_trainer.train()
File "/home/nathaniel/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/home/nathaniel/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/nathaniel/.local/lib/python3.9/site-packages/transformers/trainer.py", line 2902, in training_step
loss = self.compute_loss(model, inputs)
File "/home/nathaniel/.local/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1077, in compute_loss
loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
File "/home/nathaniel/.local/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1018, in get_batch_loss_metrics
) = self.concatenated_forward(model, batch)
File "/home/nathaniel/.local/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 981, in concatenated_forward
all_logits = model(
File "/home/nathaniel/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nathaniel/.local/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/nathaniel/.local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1842, in forward
loss = self.module(*inputs, **kwargs)
File "/home/nathaniel/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nathaniel/.local/lib/python3.9/site-packages/peft/peft_model.py", line 1083, in forward
return self.base_model(
File "/home/nathaniel/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nathaniel/.local/lib/python3.9/site-packages/peft/tuners/tuners_utils.py", line 161, in forward
return self.model.forward(*args, **kwargs)
File "/home/nathaniel/.local/lib/python3.9/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/nathaniel/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 1168, in forward
outputs = self.model(
File "/home/nathaniel/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nathaniel/.local/lib/python3.9/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/nathaniel/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 966, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/home/nathaniel/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/nathaniel/.local/lib/python3.9/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/nathaniel/.local/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/home/nathaniel/.local/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
Based on the accelerate deepspeed integration guides and other tutorials I've seen I was expecting the switch to deepspeed above to run without the above error.
Should also add this I guess
`Accelerate` version: 0.27.2
- Platform: Linux-5.10.0-28-cloud-amd64-x86_64-with-glibc2.31
- Python version: 3.9.2
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 188.71 GB
- GPU type: NVIDIA L4
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- use_cpu: False
- debug: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'deepspeed_config_file': '/home/nathaniel/llava/dpo-slerp/zero2.json', 'zero3_init_flag': True}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Other package versions:
transformers==4.38.1
peft==0.8.2
trl==0.7.11
I don't have experience with DeepSpeed, so I can't really help you here. But I wanted to mention that we're currently adding a PEFT + DS guide to the PEFT docs, maybe you can find something useful in there.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Was this issue solved?
same issue. why closed?