accelerate
accelerate copied to clipboard
Training with Accelerator Fails. RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)
I am trying to train a BLOOM-3B model on a setup with 8 GPUS of 20GB each.
The training code is similar to the tutorial here: Distributed training with Accelerate. There is no "main" function used in my code.
The model is loaded with the device map "balanced_low_0"
if get_world_size() > 1:
kwargs["device_map"] = "balanced_low_0"
model = AutoModelForCausalLM.from_pretrained(model_name, **kwargs)
Some of the layers are frozen using param.requires_grad = False
The accelerate config file I'm using is has the following parameters:
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
gpu_ids : 0,1,2,3,4,5,6,7
downcast_bf16: 'no'
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
use_cpu: false
On launching the code with accelerate and the above config I get the following error:
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/mqa_new/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
output = old_forward(*args, **kwargs)
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/mqa_new/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(return F.embedding(
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/mqa_new/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/mqa_new/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:7! (when checking argument for argument index in method wrapper__index_select)
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)
I have tried with both accelerator version 0.15.0 and 0.16.0 and the problem persists. Please help me understand what am I missing?
Could you share a minimal sample of code reproducing the error please?
I run the following code (bloom-accelerate-trainer-minimal.py) on a setup of 8 A4500 GPUs of 20GB vRAM
import argparse
import os
from transformers import AdamW, get_linear_schedule_with_warmup
from datasets import load_dataset
from torch.utils.data import DataLoader
import torch
from tqdm import tqdm
import torch.distributed as dist
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import Accelerator
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", required=False, type=int, help="used by dist launchers")
parser.add_argument("--model_name", type=str, help="Model name or path", required=True)
parser.add_argument("--cache_dir", type=str, default="/home/Models/", help="Path to the Cache")
parser.add_argument("--data_path", type=str, help="Path to datasets folder")
parser.add_argument("--per_device_batch_size", default=1, type=int, help="batch size")
parser.add_argument("--num_train_epochs", type=int, default=5, help="Number of training epochs")
parser.add_argument("--learning_rate", type=float, default=0.00001, help="Learning rate")
parser.add_argument("--save_steps", type=int, default=5000, help="Checkpoint saving frequency")
parser.add_argument("--eval_steps", type=int, default=1000, help="Evaluation frequency")
parser.add_argument("--dtype", type=str, help="float16 or int8", choices=["int8", "float16"], default="float16")
return parser.parse_args()
args = get_args()
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = torch.cuda.device_count()
rank = local_rank
accelerator = Accelerator()
def preprocess_train_examples(examples):
inputs = examples['review']
model_inputs = tokenizer(inputs, max_length=512, padding="max_length", truncation=True)
model_inputs["labels"] = model_inputs["input_ids"].copy()
return model_inputs
def print_rank0(*msg):
if rank != 0:
return
print(*msg)
def get_world_size() -> int:
if dist.is_initialized():
return dist.get_world_size()
else:
return 1
print_rank0(f"Using {world_size} gpus")
model_name = args.model_name
print_rank0(f"Loading model {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=args.cache_dir)
# XXX: can't automatically derive dtype via config's `from_pretrained`
dtype = torch.bfloat16 if model_name in ["bigscience/bloom", "bigscience/bigscience-small-testing"] else torch.float16
# print(get_max_memory_per_gpu_dict())
infer_dtype = args.dtype
if infer_dtype == "int8":
dtype = torch.int8
kwargs = dict(
device_map="auto",
)
# balanced_low_0 - because it allows a larger batch size with multiple GPUs
if get_world_size() > 1:
kwargs["device_map"] = "balanced_low_0"
if infer_dtype == "int8":
print_rank0("Using `load_in_8bit=True` to use quanitized model")
kwargs["load_in_8bit"] = True
else:
kwargs["torch_dtype"] = dtype
if args.cache_dir is not None:
kwargs["cache_dir"] = args.cache_dir
model = AutoModelForCausalLM.from_pretrained(model_name, **kwargs)
# Freezing all but last layer
for layer in model.transformer.h[:-1]:
for param in layer.parameters():
param.requires_grad = False
trainables=[]
for p in model.parameters():
if p.requires_grad:
trainables.append(p)
print_rank0(f"\n\n Total Parameters:\t {sum(tp.numel() for tp in model.parameters())}\n Trainable Parameters:\t {sum(tp.numel() for tp in trainables)}\n\n")
### Train
# Load data
dataset = load_dataset("lhoestq/demo1")
encoded_train_ds = dataset["train"].map(preprocess_train_examples,
batched=True,
remove_columns=dataset["train"].column_names)
encoded_train_ds.set_format("torch", columns=['input_ids', 'attention_mask', 'labels'])
train_dataloader = DataLoader(encoded_train_ds, shuffle=True, batch_size=8*args.per_device_batch_size)
# Instantiate optimizer
optimizer = AdamW(model.parameters(), lr=args.learning_rate)
num_training_steps = args.num_train_epochs * len(train_dataloader) # 250000
scheduler = get_linear_schedule_with_warmup(
optimizer, num_warmup_steps=0, num_training_steps=num_training_steps,
)
model = accelerator.prepare(model)
train_dataloader, optimizer, scheduler = accelerator.prepare(train_dataloader, optimizer, scheduler)
progress_bar = tqdm(range(num_training_steps))
model.train()
for epoch in range(args.num_train_epochs):
for batch in train_dataloader:
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
Launched as
accelerate launch --config_file /data/home/z004nuxn/aac_mgpu_config.yml \
bloom-accelerate-trainer-minimal.py \
--model_name bigscience/bloomz-3b
where aac_mgpu_config.yml
contains the parameters as listed in my previous comment.
This leads to a RuntimeError (Full Traceback):
Traceback (most recent call last):
File "bloom-inference-scripts/bloom-accelerate-trainer-minimal.py", line 137, in <module>
Traceback (most recent call last):
File "bloom-inference-scripts/bloom-accelerate-trainer-minimal.py", line 137, in <module>
outputs = model(**batch)
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
outputs = model(**batch)
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
return forward_call(*input, **kwargs) File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return module_to_run(*inputs[0], **kwargs[0])
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
return forward_call(*input, **kwargs)
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
output = old_forward(*args, **kwargs)
File "/home/z004nuxn/.local/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 903, in forward
output = old_forward(*args, **kwargs)
File "/home/z004nuxn/.local/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 903, in forward
transformer_outputs = self.transformer(
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
transformer_outputs = self.transformer(
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
return forward_call(*input, **kwargs)
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
output = old_forward(*args, **kwargs)
File "/home/z004nuxn/.local/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 729, in forward
output = old_forward(*args, **kwargs)
File "/home/z004nuxn/.local/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 729, in forward
inputs_embeds = self.word_embeddings(input_ids)inputs_embeds = self.word_embeddings(input_ids)
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
return forward_call(*input, **kwargs) File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
output = old_forward(*args, **kwargs)
output = old_forward(*args, **kwargs)
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
return F.embedding(
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeErrorreturn torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse):
Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:7! (when checking argument for argument index in method wrapper__index_select)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)
The error occurs the same for both accelerator versions 0.15 and 0.16.
Big model inference is only for inference, not training at this time.
Oops: I'm wrong!
@muellerzr the problem is in the forward though ;-) And it should work for training as long as there is no offload.
@muellerzr the problem is in the forward though ;-) And it should work for training as long as there is no offload.
There isn't CPU offload as far as I'm aware. Also, the tensor device mismatch is between the GPUs and not a CPU and a GPU.
@ananda1996ai
First note that you cannot use data parallel in conjunction with model parallelism, so num_processes in your config needs to be 1. I cannot reproduce the error, could you copy and paste here the result of model._hf_device_map
so we can have debug more? Note that for training device_map="balanced"
is more recommended than device_map="balanced_low_0"
.
Could you also try the just released v0.17.0 to make sure your bug has not been already fixed?
Did you mean model.hf_device_map
? There is no attribute _hf_device_map
.
The output of print_rank0(model.hf_device_map)
is simply {'': 7}
which is not correct perhaps.
I set the device_map="balanced"
and num_processes = 1
still got similar error:
Traceback (most recent call last):
File "bloom-inference-scripts/bloom-accelerate-trainer-minimal.py", line 139, in <module>
outputs = model(**batch)
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
output = old_forward(*args, **kwargs)
File "/home/z004nuxn/.local/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 903, in forward
transformer_outputs = self.transformer(
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
output = old_forward(*args, **kwargs)
File "/home/z004nuxn/.local/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 729, in forward
inputs_embeds = self.word_embeddings(input_ids)
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
output = old_forward(*args, **kwargs)
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:7! (when checking argument for argument index in method wrapper__index_select)
Oh the problem is quite clear then, the process only sees GPU 7. I think it all stems from the fact that you use num_processes=2
in your accelerate config.
Oh the problem is quite clear then, the process only sees GPU 7. I think it all stems from the fact that you use
num_processes=2
in your accelerate config.
The last outputs I shared was with num_processes=1
so it didn't make a difference.
With higher num_processes like 4 or 8, the tensor device mismatch error isn't there but it's a CUDA OOM error.
@sgugger It seems that num_processes
is entangled with the number of GPU to use. If set to 1, only one GPU will be used when there are multiple GPUs and I try to apply the model parallel with deepspeed ZeRO3.
Besides, the example config file from the blog seems also set num_processes=2
while model parallel
@TingchenFu Have you solved the problem yet? I run into the same problem!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Same problem @sgugger
same issue
same issue but with deepspeed and lora
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.