accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Training with Accelerator Fails. RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)

Open ananda1996ai opened this issue 1 year ago • 14 comments

I am trying to train a BLOOM-3B model on a setup with 8 GPUS of 20GB each.

The training code is similar to the tutorial here: Distributed training with Accelerate. There is no "main" function used in my code.

The model is loaded with the device map "balanced_low_0"

if get_world_size() > 1:
    kwargs["device_map"] = "balanced_low_0"

model = AutoModelForCausalLM.from_pretrained(model_name, **kwargs)

Some of the layers are frozen using param.requires_grad = False

The accelerate config file I'm using is has the following parameters:

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
gpu_ids : 0,1,2,3,4,5,6,7
downcast_bf16: 'no'
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
use_cpu: false

On launching the code with accelerate and the above config I get the following error:

  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/mqa_new/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    output = old_forward(*args, **kwargs)
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/mqa_new/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
        return F.embedding(return F.embedding(

  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/mqa_new/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/mqa_new/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:7! (when checking argument for argument index in method wrapper__index_select)
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)

I have tried with both accelerator version 0.15.0 and 0.16.0 and the problem persists. Please help me understand what am I missing?

ananda1996ai avatar Mar 09 '23 12:03 ananda1996ai

Could you share a minimal sample of code reproducing the error please?

sgugger avatar Mar 09 '23 12:03 sgugger

I run the following code (bloom-accelerate-trainer-minimal.py) on a setup of 8 A4500 GPUs of 20GB vRAM

import argparse
import os
from transformers import AdamW, get_linear_schedule_with_warmup
from datasets import load_dataset
from torch.utils.data import DataLoader

import torch
from tqdm import tqdm
import torch.distributed as dist

from transformers import AutoModelForCausalLM, AutoTokenizer

from accelerate import Accelerator


def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", required=False, type=int, help="used by dist launchers")
    parser.add_argument("--model_name", type=str, help="Model name or path", required=True)
    parser.add_argument("--cache_dir", type=str, default="/home/Models/", help="Path to the Cache")
    parser.add_argument("--data_path", type=str, help="Path to datasets folder")
    parser.add_argument("--per_device_batch_size", default=1, type=int, help="batch size")
    parser.add_argument("--num_train_epochs", type=int, default=5, help="Number of training epochs")
    parser.add_argument("--learning_rate", type=float, default=0.00001, help="Learning rate")
    parser.add_argument("--save_steps", type=int, default=5000, help="Checkpoint saving frequency")
    parser.add_argument("--eval_steps", type=int, default=1000, help="Evaluation frequency")
    parser.add_argument("--dtype", type=str, help="float16 or int8", choices=["int8", "float16"], default="float16")

    return parser.parse_args()


args = get_args()

local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = torch.cuda.device_count()

rank = local_rank

accelerator = Accelerator()


def preprocess_train_examples(examples):
    inputs = examples['review']
    model_inputs = tokenizer(inputs, max_length=512, padding="max_length", truncation=True)
    model_inputs["labels"] = model_inputs["input_ids"].copy()
    return model_inputs


def print_rank0(*msg):
    if rank != 0:
        return
    print(*msg)


def get_world_size() -> int:
    if dist.is_initialized():
        return dist.get_world_size()
    else:
        return 1


print_rank0(f"Using {world_size} gpus")
model_name = args.model_name
print_rank0(f"Loading model {model_name}")

tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=args.cache_dir)

# XXX: can't automatically derive dtype via config's `from_pretrained`
dtype = torch.bfloat16 if model_name in ["bigscience/bloom", "bigscience/bigscience-small-testing"] else torch.float16

# print(get_max_memory_per_gpu_dict())

infer_dtype = args.dtype
if infer_dtype == "int8":
    dtype = torch.int8

kwargs = dict(
    device_map="auto",
)

# balanced_low_0 - because it allows a larger batch size with multiple GPUs
if get_world_size() > 1:
    kwargs["device_map"] = "balanced_low_0"
if infer_dtype == "int8":
    print_rank0("Using `load_in_8bit=True` to use quanitized model")
    kwargs["load_in_8bit"] = True
else:
    kwargs["torch_dtype"] = dtype
if args.cache_dir is not None:
    kwargs["cache_dir"] = args.cache_dir


model = AutoModelForCausalLM.from_pretrained(model_name, **kwargs)

# Freezing all but last layer
for layer in model.transformer.h[:-1]:
    for param in layer.parameters():
        param.requires_grad = False

trainables=[]
for p in model.parameters():
    if p.requires_grad:
        trainables.append(p)

print_rank0(f"\n\n Total Parameters:\t {sum(tp.numel() for tp in model.parameters())}\n Trainable Parameters:\t {sum(tp.numel() for tp in trainables)}\n\n")

### Train

# Load data
dataset = load_dataset("lhoestq/demo1")

encoded_train_ds = dataset["train"].map(preprocess_train_examples,
                                        batched=True,
                                        remove_columns=dataset["train"].column_names)

encoded_train_ds.set_format("torch", columns=['input_ids', 'attention_mask', 'labels'])

train_dataloader = DataLoader(encoded_train_ds, shuffle=True, batch_size=8*args.per_device_batch_size)

# Instantiate optimizer
optimizer = AdamW(model.parameters(), lr=args.learning_rate)

num_training_steps = args.num_train_epochs * len(train_dataloader)  # 250000
scheduler = get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps=0, num_training_steps=num_training_steps,
    )

model = accelerator.prepare(model)

train_dataloader, optimizer, scheduler = accelerator.prepare(train_dataloader, optimizer, scheduler)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(args.num_train_epochs):
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Launched as

accelerate launch --config_file /data/home/z004nuxn/aac_mgpu_config.yml \
bloom-accelerate-trainer-minimal.py \
--model_name bigscience/bloomz-3b

where aac_mgpu_config.yml contains the parameters as listed in my previous comment.

This leads to a RuntimeError (Full Traceback):

Traceback (most recent call last):
  File "bloom-inference-scripts/bloom-accelerate-trainer-minimal.py", line 137, in <module>
Traceback (most recent call last):
  File "bloom-inference-scripts/bloom-accelerate-trainer-minimal.py", line 137, in <module>
    outputs = model(**batch)
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    outputs = model(**batch)
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
return forward_call(*input, **kwargs)  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward

  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return module_to_run(*inputs[0], **kwargs[0])
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
return forward_call(*input, **kwargs)
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/z004nuxn/.local/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 903, in forward
    output = old_forward(*args, **kwargs)
  File "/home/z004nuxn/.local/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 903, in forward
    transformer_outputs = self.transformer(
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    transformer_outputs = self.transformer(
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
    return forward_call(*input, **kwargs)
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/z004nuxn/.local/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 729, in forward
    output = old_forward(*args, **kwargs)
  File "/home/z004nuxn/.local/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 729, in forward
        inputs_embeds = self.word_embeddings(input_ids)inputs_embeds = self.word_embeddings(input_ids)

  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
    return forward_call(*input, **kwargs)  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward

  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
    output = old_forward(*args, **kwargs)
output = old_forward(*args, **kwargs)
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
return F.embedding(
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
    RuntimeErrorreturn torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse):
Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:7! (when checking argument for argument index in method wrapper__index_select)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)

The error occurs the same for both accelerator versions 0.15 and 0.16.

ananda1996ai avatar Mar 09 '23 17:03 ananda1996ai

Big model inference is only for inference, not training at this time.

Oops: I'm wrong!

muellerzr avatar Mar 09 '23 17:03 muellerzr

@muellerzr the problem is in the forward though ;-) And it should work for training as long as there is no offload.

sgugger avatar Mar 09 '23 17:03 sgugger

@muellerzr the problem is in the forward though ;-) And it should work for training as long as there is no offload.

There isn't CPU offload as far as I'm aware. Also, the tensor device mismatch is between the GPUs and not a CPU and a GPU.

ananda1996ai avatar Mar 09 '23 17:03 ananda1996ai

@ananda1996ai First note that you cannot use data parallel in conjunction with model parallelism, so num_processes in your config needs to be 1. I cannot reproduce the error, could you copy and paste here the result of model._hf_device_map so we can have debug more? Note that for training device_map="balanced" is more recommended than device_map="balanced_low_0".

Could you also try the just released v0.17.0 to make sure your bug has not been already fixed?

sgugger avatar Mar 09 '23 18:03 sgugger

Did you mean model.hf_device_map? There is no attribute _hf_device_map. The output of print_rank0(model.hf_device_map) is simply {'': 7} which is not correct perhaps. I set the device_map="balanced" and num_processes = 1 still got similar error:

Traceback (most recent call last):
  File "bloom-inference-scripts/bloom-accelerate-trainer-minimal.py", line 139, in <module>
    outputs = model(**batch)
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/z004nuxn/.local/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 903, in forward
    transformer_outputs = self.transformer(
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/z004nuxn/.local/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 729, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
    output = old_forward(*args, **kwargs)
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/MQA/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:7! (when checking argument for argument index in method wrapper__index_select)

ananda1996ai avatar Mar 10 '23 07:03 ananda1996ai

Oh the problem is quite clear then, the process only sees GPU 7. I think it all stems from the fact that you use num_processes=2 in your accelerate config.

sgugger avatar Mar 10 '23 12:03 sgugger

Oh the problem is quite clear then, the process only sees GPU 7. I think it all stems from the fact that you use num_processes=2 in your accelerate config.

The last outputs I shared was with num_processes=1 so it didn't make a difference. With higher num_processes like 4 or 8, the tensor device mismatch error isn't there but it's a CUDA OOM error.

ananda1996ai avatar Mar 10 '23 16:03 ananda1996ai

@sgugger It seems that num_processes is entangled with the number of GPU to use. If set to 1, only one GPU will be used when there are multiple GPUs and I try to apply the model parallel with deepspeed ZeRO3.

Besides, the example config file from the blog seems also set num_processes=2 while model parallel

TingchenFu avatar Mar 22 '23 09:03 TingchenFu

@TingchenFu Have you solved the problem yet? I run into the same problem!

michael-wzhu avatar Mar 30 '23 14:03 michael-wzhu

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 23 '23 15:04 github-actions[bot]

Same problem @sgugger

jacklanda avatar Apr 30 '23 10:04 jacklanda

same issue

AaronZLT avatar May 21 '23 09:05 AaronZLT

same issue but with deepspeed and lora

2018211801 avatar Jun 08 '23 14:06 2018211801

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jul 02 '23 15:07 github-actions[bot]