accelerate The incorrect hook mounting caused the tensor to be on different devices.

System Info

- `Accelerate` version: 0.33.0
- Platform: Linux-6.8.0-41-generic-x86_64-with-glibc2.39
- `accelerate` bash location: /mnt/rangehow/miniconda3/bin/accelerate
- Python version: 3.12.2
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 1007.72 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
        Not found

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

I manually set the device_map for the Qwen2.5-7B model as follows:

Device Map:

Device 0: 7 elements
Modules: model.norm, model.layers.0, model.layers.1, model.layers.2, model.layers.3, model.layers.4, model.layers.5
Device 1: 9 elements
Modules: model.layers.6, model.layers.7, model.layers.8, model.layers.9, model.layers.10, model.layers.11, model.layers.12, model.layers.13, model.layers.14
Device 2: 9 elements
Modules: model.layers.15, model.layers.16, model.layers.17, model.layers.18, model.layers.19, model.layers.20, model.layers.21, model.layers.22, model.layers.23
Device 3: 6 elements
Modules: model.layers.24, model.layers.25, model.layers.26, model.layers.27, lm_head, model.embed_tokens

using

model = AutoModelForCausalLM.from_pretrained(
            model_dir,
            torch_dtype="auto",
            device_map=device_map,
            low_cpu_mem_usage="True",
            trust_remote_code=True,
        )

When I directly feed a tensor into the model, everything works fine:

result = model(input_ids=input_ids, attention_mask=attention_mask) # tensors on cpu
result = model(input_ids=input_ids.to(model.device), attention_mask=attention_mask.to(model.device)) # tensors on gpu

However, after introducing transformers trainer.train(), a bug appeared. Even though model.device is clearly cuda:3, after passing through https://github.com/huggingface/accelerate/blob/4305033f8035defad0a87cd38e5c918e78510ba5/src/accelerate/hooks.py#L165, the input is forcibly converted back to cuda:0 on the totally same two forward code above.

https://github.com/huggingface/accelerate/blob/4305033f8035defad0a87cd38e5c918e78510ba5/src/accelerate/hooks.py#L242 The hook incorrectly identified the execution device and moved the tensor without the user's knowledge, which broke everything. Is there any way I can intervene in this? Why is execution_device not the same as model.device?

~~At the same time, why does this issue not occur when performing a direct forward pass after loading the model, but it does occur after passing the model into the Trainer?~~ if not training will drop the behavior of generate causal mask

Expected behavior

Correct deal with device

Sep 24 '24 04:09 rangehow

Hey @rangehow, thanks for the report !

However, after introducing transformers trainer.train(), a bug appeared. Even though model.device is clearly cuda:3, after passing through

What do you mean by introducing trainer.train() ? The bug appears after we trained a model with Trainer ?

Even though model.device is clearly cuda:3, after passing through ... , the input is forcibly converted back to cuda:0 on the totally same two forward code above.

Do you mean module.device since the model is spread around multiple devices ? If so, this is indeed strange that the input is moved to cuda:0. The execution device is set in the dispatch_model function so it should stay the same before or after the training. The only case we put a different execution device from the module device is when the module is offloaded to cpu.

Could you share a reproducer of the issue you are facing + traceback ?

Sep 30 '24 15:09 SunMarc

Sorry for the late reply. I will try to provide a minimal reproducible code soon.

Oct 03 '24 07:10 rangehow

Hey @SunMarc, thanks again for your help. I'm back, and to reproduce this code, you'll need to find a machine with 4 GPUs. In the end, it should generate the following traceback.

import torch
from transformers import Trainer, TrainingArguments, AutoTokenizer, AutoModelForCausalLM

# Fake dataset size and input configuration
batch_size = 2  # Batch size for each training step
seq_length = 8  # Input sequence length
vocab_size = 30522  # Used to generate a random tensor within the vocabulary index

# Model and tokenizer path
model_name = "Qwen/Qwen2.5-7B-Instruct"
device_map = {'model.norm': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 1, 'model.layers.10': 1, 'model.layers.11': 1, 'model.layers.12': 1, 'model.layers.13': 1, 'model.layers.14': 1, 'model.layers.15': 1, 'model.layers.16': 1, 'model.layers.17': 2, 'model.layers.18': 2, 'model.layers.19': 2, 'model.layers.20': 2, 'model.layers.21': 2, 'model.layers.22': 2, 'model.layers.23': 2, 'model.layers.24': 2, 'model.layers.25': 3, 'model.layers.26': 3, 'model.layers.27': 3, 'lm_head': 3, 'model.embed_tokens': 3}

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map=device_map,
    trust_remote_code=True,
)


# Fake training data
def generate_fake_dataset_as_list(num_samples, seq_length, vocab_size):
    dataset = []
    for _ in range(num_samples):
        input_ids = torch.randint(0, vocab_size, (seq_length,)).tolist()  # Randomly generate input_ids
        labels = input_ids.copy()  # labels are usually identical to input_ids unless there's a specific task requirement
        
        # Randomly generate some 0s in the mask to simulate the case where some tokens are masked
        attention_mask = torch.randint(0, 2, (seq_length,)).tolist()  # Random mask of 0s or 1s
        
        dataset.append({
            "input_ids": input_ids,
            "labels": labels,
            "attention_mask": attention_mask
        })
    return dataset


# Create a fake dataset
fake_train_dataset = generate_fake_dataset_as_list(10, seq_length, vocab_size)  # 10 samples

# Define training parameters
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="no",
    eval_strategy="no",
    bf16=True,
)

# Use Trainer to conduct training
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=fake_train_dataset,
)


trainer.train()

Traceback

Traceback (most recent call last):
  File "/mnt/rangehow/1/11.py", line 88, in <module>
    trainer.train()
  File "/mnt/rangehow/miniconda3/lib/python3.12/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/mnt/rangehow/miniconda3/lib/python3.12/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/rangehow/miniconda3/lib/python3.12/site-packages/transformers/trainer.py", line 3318, in training_step
    loss = self.compute_loss(model, inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/rangehow/1/11.py", line 75, in compute_loss
    result = model(
             ^^^^^^
  File "/mnt/rangehow/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1554, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/rangehow/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1564, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/rangehow/miniconda3/lib/python3.12/site-packages/accelerate/utils/operations.py", line 820, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/rangehow/miniconda3/lib/python3.12/site-packages/accelerate/utils/operations.py", line 808, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/rangehow/miniconda3/lib/python3.12/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/rangehow/miniconda3/lib/python3.12/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/rangehow/miniconda3/lib/python3.12/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1115, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/mnt/rangehow/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1554, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/rangehow/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1564, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/rangehow/miniconda3/lib/python3.12/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 892, in forward
    causal_mask = self._update_causal_mask(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/rangehow/miniconda3/lib/python3.12/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1011, in _update_causal_mask
    causal_mask = _prepare_4d_causal_attention_mask_with_cache_position(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/rangehow/miniconda3/lib/python3.12/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 109, in _prepare_4d_causal_attention_mask_with_cache_position
    padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]
                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0!
  0%|          | 0/15 [00:03<?, ?it/s]

Oct 03 '24 09:10 rangehow

If there's anything else you need me to clarify or help test, feel free to reply at any time. I hope this bug can be resolved soon, as it currently requires model.device to be the 0th device (i.e., lm_head must be assigned to the 0th device).

Oct 03 '24 09:10 rangehow

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Oct 28 '24 15:10 github-actions[bot]

accelerate accelerate copied to clipboard

The incorrect hook mounting caused the tensor to be on different devices.

System Info

Information

Tasks

Reproduction

Expected behavior

accelerate
accelerate copied to clipboard