transformers device_map='auto' gives bad results

System Info

transformers version: 4.25.1
Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.17
Python version: 3.8.15
Huggingface_hub version: 0.11.1
PyTorch version (GPU?): 1.11.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no
GPUs: two A100

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Minimal test example:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = 'EleutherAI/gpt-neo-125M'
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(model_name)

sentence = 'Hello, nice to meet you. How are'
with torch.no_grad():
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    gen_tokens = model.generate(tensor_input, max_length=32)
    generated = tokenizer.batch_decode(gen_tokens)[0]

print(generated)

Results:

Hello, nice to meet you. How are noise retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy

The above result is not expected behavior. Without device_map='auto' at line 5, it works correctly. Line 5 becomes model = AutoModelForCausalLM.from_pretrained(model_name)

Results:

Hello, nice to meet you. How are you?

I’m a bit of a newbie to the world of web development, but I

My machine has two A100 (80 GB) GPUs, and I confirmed that the model is loaded on two GPUs when I use device_map='auto'.

Expected behavior

Explained above

Dec 26 '22 08:12 youngwoo-yoon

Hi @youngwoo-yoon

Thanks for the issue! What is your version of accelerate ? With the latest version (0.15.0) & same pytorch version I get (on a NVIDIA T4) on the minimal test example shared above that uses device_map=auto :

Hello, nice to meet you. How are you?

I’m a bit of a newbie to the world of web development, but I

Dec 26 '22 09:12 younesbelkada

Hello, @younesbelkada I'm using the same version 0.15.0 of accelerate. I also got the correct result when I ran with export CUDA_VISIBLE_DEVICES=0 Still wrong results with two GPUS export CUDA_VISIBLE_DEVICES=0,1

Dec 26 '22 09:12 youngwoo-yoon

Thanks for the details! I still did not managed to reproduce, can you try this snippet instead:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = 'EleutherAI/gpt-neo-125M'
model = AutoModelForCausalLM.from_pretrained(model_name, device_map={"transformer.wte":0, "transformer.wpe":0, "transformer.h":1, "transformer.ln_f":1, "lm_head":1})
tokenizer = AutoTokenizer.from_pretrained(model_name)

sentence = 'Hello, nice to meet you. How are'
with torch.no_grad():
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    gen_tokens = model.generate(tensor_input, max_length=32)
    generated = tokenizer.batch_decode(gen_tokens)[0]

print(generated)

and let me know if the problem still persists? We're using the same Pytorch, transformers, accelerate version. The only difference is on the hardware (I am using 2xNvidia T4) Can you also try your script with export CUDA_VISIBLE_DEVICES=1 instead of export CUDA_VISIBLE_DEVICES=0?

Dec 26 '22 09:12 younesbelkada

Thanks for the quick replies. This is the result and it still doesn't look good.

Hello, nice to meet you. How are!!!!!!!!!!!!!!!!!!!!!!!

My original test code with export CUDA_VISIBLE_DEVICES=1 gives the same correct result with export CUDA_VISIBLE_DEVICES=0

Hello, nice to meet you. How are you?

I’m a bit of a newbie to the world of web development, but I

Dec 26 '22 09:12 youngwoo-yoon

I am slightly unsure here about what could be causing the issue but I suspect it's highly correlated to the fact that you're running your script under two RTX A6000 but not sure @sgugger do you think that the problem can be related to accelerate & the fact that the script is running under two RTX A6000 instead of another hardware (i.e. have you seen similar discrepancy errors in the past)? @youngwoo-yoon could you ultimately try the script with the latest pytorch version (1.13.1)?

Dec 26 '22 10:12 younesbelkada

@younesbelkada, I got the same wrong result with PyTorch 1.13.1.

Hello, nice to meet you. How are noise retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy

Dec 26 '22 10:12 youngwoo-yoon

Mmmm there is no reason for the script to give different results for different GPUs, especially since removing the device_map="auto" gives the same results.

I also can't reproduce on my side. Are you absolutely certain your script is launched in the same Python environment you are reporting? E.g. can you print the versions of Accelerate/Transformers/Pytorch in the same script?

Dec 27 '22 07:12 sgugger

I put the test scripts using cpu, gpu0, gpu1, and device_map=auto on a single python file to be sure.

from importlib.metadata import version
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

print('torch', version('torch'))
print('transformers', version('transformers'))
print('accelerate', version('accelerate'))
print('# of gpus: ', torch.cuda.device_count())

# cpu
model_name = 'EleutherAI/gpt-neo-125M'
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

sentence = 'Hello, nice to meet you. How are'
with torch.no_grad():
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    gen_tokens = model.generate(tensor_input, max_length=32)
    generated = tokenizer.batch_decode(gen_tokens)[0]

print(generated)
print('-------------------------------------------')

# on the gpu 0
model = AutoModelForCausalLM.from_pretrained(model_name)
model = model.to('cuda:0')

with torch.no_grad():
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    tensor_input = tensor_input.to('cuda:0')
    gen_tokens = model.generate(tensor_input, max_length=32)
    generated = tokenizer.batch_decode(gen_tokens)[0]

print(generated)
print('-------------------------------------------')

# on the gpu 1
model = AutoModelForCausalLM.from_pretrained(model_name)
model = model.to('cuda:1')

with torch.no_grad():
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    tensor_input = tensor_input.to('cuda:1')
    gen_tokens = model.generate(tensor_input, max_length=32)
    generated = tokenizer.batch_decode(gen_tokens)[0]

print(generated)
print('-------------------------------------------')

# with device_map=auto
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto')

with torch.no_grad():
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    gen_tokens = model.generate(tensor_input, max_length=32)
    generated = tokenizer.batch_decode(gen_tokens)[0]

print(generated)

And this the result

torch 1.13.1
transformers 4.25.1
accelerate 0.15.0
# of gpus:  2
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Hello, nice to meet you. How are you?

I’m a bit of a newbie to the world of web development, but I
-------------------------------------------
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Hello, nice to meet you. How are you?

I’m a bit of a newbie to the world of web development, but I
-------------------------------------------
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Hello, nice to meet you. How are you?

I’m a bit of a newbie to the world of web development, but I
-------------------------------------------
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/transformers/generation/utils.py:1470: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.
  warnings.warn(
Hello, nice to meet you. How are noise retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy

And this is nvidia-smi results

Tue Dec 27 16:57:48 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.106.00   Driver Version: 460.106.00   CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100 80GB PCIe      Off  | 00000000:4F:00.0 Off |                    0 |
| N/A   36C    P0    47W / 300W |      9MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100 80GB PCIe      Off  | 00000000:52:00.0 Off |                    0 |
| N/A   37C    P0    45W / 300W |      9MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2915      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A    119486      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      2915      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A    119486      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

Dec 27 '22 07:12 youngwoo-yoon

There is a warning

/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/transformers/generation/utils.py:1470: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.

You did move the inputs when processing on one of the two GPUs, it might be necessary here too. Could you print the hf_device_map attribute of the model and try to move the inputs to cuda device 0 and 1?

Dec 27 '22 12:12 sorgfresser

I moved inputs to cuda:0 and cuda:1 but both gave the same wrong result. Below is the output when I moved inputs to cuda:0.

torch 1.13.1
transformers 4.25.1
accelerate 0.15.0
# of gpus: 2
hf_device_map output: {'transformer.wte': 0, 'lm_head': 0, 'transformer.wpe': 0, 'transformer.drop': 0, 'transformer.h.0': 0, 'transformer.h.1': 0, 'transformer.h.2': 0, 'transformer.h.3': 0, 'transformer.h.4': 0, 'transformer.h.5': 0, 'transformer.h.6': 1, 'transformer.h.7': 1, 'transformer.h.8': 1, 'transformer.h.9': 1, 'transformer.h.10': 1, 'transformer.h.11': 1, 'transformer.ln_f': 1}
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Hello, nice to meet you. How are noiseleanor pressuring retaliate incarcer boundousy]= incarcer incarcer high * Karin�� Annotationsousyousyousy pressuring retaliateousyousyousy

I will try to reproduce this issue on another machine having two GPUs.

Dec 28 '22 00:12 youngwoo-yoon

It works well on another machine with two Quadro 6000 GPUs. I've tried different device_map strategies 'sequential' and 'balanced_low_0', but it still fails when two A100 GPUs are used.

I ran accelerate test command which tests accelerate library but it also failed. It seems like a problem of accelerate library. I found some other people also had problems with A100 GPUs. Related issue: https://github.com/huggingface/accelerate/issues/934

Jan 02 '23 06:01 youngwoo-yoon

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jan 26 '23 15:01 github-actions[bot]

Hi @younesbelkada I got the same error with two V100, with accelerate version 0.18.0 prompt = 'Q: What is the largest animal?\nA:' output:

A: The blue whale.
Q: What is the largest animal?
A: The blue whale. It is the largest animal on Earth. It is also the largest mammal. It is the largest creature that has ever lived.
Q: What is the largest animal?
A: The blue whale is the largest animal on Earth. It is also the largest mammal. It is the largest creature that has ever lived.
Q: What is the largest animal?
A: The blue whale is the largest animal on Earth. It is also the largest mammal. It is the largest creature that has ever lived.
Q: What is the largest animal?
A: The blue whale is the largest animal on Earth. It is also the largest mammal. It is the largest creature that has ever lived.
Q: What is the largest animal?
A: The blue whale is the largest animal on Earth. It is also the largest mammal. It is the largest creature that has ever lived.
Q: What is the largest animal?
A: The blue whale is the largest animal on Earth. It is also the largest mammal. It is the largest creature that has ever lived.
Q

code:

model_path = 'openlm-research/open_llama_3b'

tokenizer = LlamaTokenizer.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float16, device_map='auto'
)

prompt = 'Q: What is the largest animal?\nA:'
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to('cuda')

generation_output = model.generate(
    input_ids=input_ids, max_length=400
)
print(tokenizer.decode(generation_output[0]))

Have you found a solution？

Jun 27 '23 09:06 yuchguo1007

I think you should add the prompt which is the same one in the training. Moreover, please note the special token that you add. Example: In the training, I tokenize:

`f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n ### Input: <s>{input}</s>. \n### Response: <s>{ouput}</s>"`

Afterward, I used the model:

text = f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n ### Input: {input}. \n### Response: "
batch = tokenizer(text, return_tensors='pt', padding=True, return_token_type_ids=False)
with torch.cuda.amp.autocast():
      output_tokens = model.generate(**batch, max_new_tokens=500)
decode = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
decode_text = decode[len(text):]
print(decode_text)

Hope to help you!

Aug 29 '23 02:08 nhungntaime

It works well on another machine with two Quadro 6000 GPUs. I've tried different device_map strategies 'sequential' and 'balanced_low_0', but it still fails when two A100 GPUs are used.

I ran accelerate test command which tests accelerate library but it also failed. It seems like a problem of accelerate library. I found some other people also had problems with A100 GPUs. Related issue: huggingface/accelerate#934

@youngwoo-yoon hi, have you solved this problem? I have the same problem on A100

Sep 01 '23 12:09 ZaVang

I'm also running into a similar issue, except with A6000s. With 1 A6000 and the rest of the weights on cpu, I get coherent text. With multiple A6000s, I get garbage outputs.

Sep 02 '23 04:09 tsengalb99

I solved this problem by disabling ACS in BIOS. This document might be helpful to some of you. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html

Sep 02 '23 09:09 youngwoo-yoon

I solved this problem by disabling ACS in BIOS. This document might be helpful to some of you. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html

Amazing!!! It works for me.

Mar 29 '24 17:03 yuge-byte

transformers transformers copied to clipboard

device_map='auto' gives bad results

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard