transformers
transformers copied to clipboard
device_map='auto' gives bad results
System Info
-
transformers
version: 4.25.1 -
Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.17
-
Python version: 3.8.15
-
Huggingface_hub version: 0.11.1
-
PyTorch version (GPU?): 1.11.0 (True)
-
Tensorflow version (GPU?): not installed (NA)
-
Flax version (CPU?/GPU?/TPU?): not installed (NA)
-
Jax version: not installed
-
JaxLib version: not installed
-
Using GPU in script?: yes
-
Using distributed or parallel set-up in script?: no
-
GPUs: two A100
Who can help?
No response
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Minimal test example:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = 'EleutherAI/gpt-neo-125M'
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(model_name)
sentence = 'Hello, nice to meet you. How are'
with torch.no_grad():
tokenize_input = tokenizer.tokenize(sentence)
tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
gen_tokens = model.generate(tensor_input, max_length=32)
generated = tokenizer.batch_decode(gen_tokens)[0]
print(generated)
Results:
Hello, nice to meet you. How are noise retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy
The above result is not expected behavior.
Without device_map='auto'
at line 5, it works correctly.
Line 5 becomes model = AutoModelForCausalLM.from_pretrained(model_name)
Results:
Hello, nice to meet you. How are you?
I’m a bit of a newbie to the world of web development, but I
My machine has two A100 (80 GB) GPUs, and I confirmed that the model is loaded on two GPUs when I use device_map='auto'
.
Expected behavior
Explained above
Hi @youngwoo-yoon
Thanks for the issue!
What is your version of accelerate
? With the latest version (0.15.0
) & same pytorch version I get (on a NVIDIA T4) on the minimal test example shared above that uses device_map=auto
:
Hello, nice to meet you. How are you?
I’m a bit of a newbie to the world of web development, but I
Hello, @younesbelkada
I'm using the same version 0.15.0
of accelerate
.
I also got the correct result when I ran with export CUDA_VISIBLE_DEVICES=0
Still wrong results with two GPUS export CUDA_VISIBLE_DEVICES=0,1
Thanks for the details! I still did not managed to reproduce, can you try this snippet instead:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = 'EleutherAI/gpt-neo-125M'
model = AutoModelForCausalLM.from_pretrained(model_name, device_map={"transformer.wte":0, "transformer.wpe":0, "transformer.h":1, "transformer.ln_f":1, "lm_head":1})
tokenizer = AutoTokenizer.from_pretrained(model_name)
sentence = 'Hello, nice to meet you. How are'
with torch.no_grad():
tokenize_input = tokenizer.tokenize(sentence)
tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
gen_tokens = model.generate(tensor_input, max_length=32)
generated = tokenizer.batch_decode(gen_tokens)[0]
print(generated)
and let me know if the problem still persists?
We're using the same Pytorch, transformers
, accelerate
version. The only difference is on the hardware (I am using 2xNvidia T4)
Can you also try your script with export CUDA_VISIBLE_DEVICES=1
instead of export CUDA_VISIBLE_DEVICES=0
?
Thanks for the quick replies. This is the result and it still doesn't look good.
Hello, nice to meet you. How are!!!!!!!!!!!!!!!!!!!!!!!
My original test code with export CUDA_VISIBLE_DEVICES=1
gives the same correct result with export CUDA_VISIBLE_DEVICES=0
Hello, nice to meet you. How are you?
I’m a bit of a newbie to the world of web development, but I
I am slightly unsure here about what could be causing the issue but I suspect it's highly correlated to the fact that you're running your script under two RTX A6000 but not sure
@sgugger do you think that the problem can be related to accelerate
& the fact that the script is running under two RTX A6000 instead of another hardware (i.e. have you seen similar discrepancy errors in the past)?
@youngwoo-yoon could you ultimately try the script with the latest pytorch version (1.13.1)?
@younesbelkada, I got the same wrong result with PyTorch 1.13.1.
Hello, nice to meet you. How are noise retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy
Mmmm there is no reason for the script to give different results for different GPUs, especially since removing the device_map="auto" gives the same results.
I also can't reproduce on my side. Are you absolutely certain your script is launched in the same Python environment you are reporting? E.g. can you print the versions of Accelerate/Transformers/Pytorch in the same script?
I put the test scripts using cpu, gpu0, gpu1, and device_map=auto on a single python file to be sure.
from importlib.metadata import version
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
print('torch', version('torch'))
print('transformers', version('transformers'))
print('accelerate', version('accelerate'))
print('# of gpus: ', torch.cuda.device_count())
# cpu
model_name = 'EleutherAI/gpt-neo-125M'
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
sentence = 'Hello, nice to meet you. How are'
with torch.no_grad():
tokenize_input = tokenizer.tokenize(sentence)
tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
gen_tokens = model.generate(tensor_input, max_length=32)
generated = tokenizer.batch_decode(gen_tokens)[0]
print(generated)
print('-------------------------------------------')
# on the gpu 0
model = AutoModelForCausalLM.from_pretrained(model_name)
model = model.to('cuda:0')
with torch.no_grad():
tokenize_input = tokenizer.tokenize(sentence)
tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
tensor_input = tensor_input.to('cuda:0')
gen_tokens = model.generate(tensor_input, max_length=32)
generated = tokenizer.batch_decode(gen_tokens)[0]
print(generated)
print('-------------------------------------------')
# on the gpu 1
model = AutoModelForCausalLM.from_pretrained(model_name)
model = model.to('cuda:1')
with torch.no_grad():
tokenize_input = tokenizer.tokenize(sentence)
tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
tensor_input = tensor_input.to('cuda:1')
gen_tokens = model.generate(tensor_input, max_length=32)
generated = tokenizer.batch_decode(gen_tokens)[0]
print(generated)
print('-------------------------------------------')
# with device_map=auto
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto')
with torch.no_grad():
tokenize_input = tokenizer.tokenize(sentence)
tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
gen_tokens = model.generate(tensor_input, max_length=32)
generated = tokenizer.batch_decode(gen_tokens)[0]
print(generated)
And this the result
torch 1.13.1
transformers 4.25.1
accelerate 0.15.0
# of gpus: 2
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Hello, nice to meet you. How are you?
I’m a bit of a newbie to the world of web development, but I
-------------------------------------------
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Hello, nice to meet you. How are you?
I’m a bit of a newbie to the world of web development, but I
-------------------------------------------
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Hello, nice to meet you. How are you?
I’m a bit of a newbie to the world of web development, but I
-------------------------------------------
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/transformers/generation/utils.py:1470: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.
warnings.warn(
Hello, nice to meet you. How are noise retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy
And this is nvidia-smi
results
Tue Dec 27 16:57:48 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.106.00 Driver Version: 460.106.00 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100 80GB PCIe Off | 00000000:4F:00.0 Off | 0 |
| N/A 36C P0 47W / 300W | 9MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100 80GB PCIe Off | 00000000:52:00.0 Off | 0 |
| N/A 37C P0 45W / 300W | 9MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2915 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 119486 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2915 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 119486 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
There is a warning
/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/transformers/generation/utils.py:1470: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.
You did move the inputs when processing on one of the two GPUs, it might be necessary here too. Could you print the hf_device_map
attribute of the model and try to move the inputs to cuda device 0 and 1?
I moved inputs to cuda:0 and cuda:1 but both gave the same wrong result. Below is the output when I moved inputs to cuda:0.
torch 1.13.1
transformers 4.25.1
accelerate 0.15.0
# of gpus: 2
hf_device_map output: {'transformer.wte': 0, 'lm_head': 0, 'transformer.wpe': 0, 'transformer.drop': 0, 'transformer.h.0': 0, 'transformer.h.1': 0, 'transformer.h.2': 0, 'transformer.h.3': 0, 'transformer.h.4': 0, 'transformer.h.5': 0, 'transformer.h.6': 1, 'transformer.h.7': 1, 'transformer.h.8': 1, 'transformer.h.9': 1, 'transformer.h.10': 1, 'transformer.h.11': 1, 'transformer.ln_f': 1}
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Hello, nice to meet you. How are noiseleanor pressuring retaliate incarcer boundousy]= incarcer incarcer high * Karin�� Annotationsousyousyousy pressuring retaliateousyousyousy
I will try to reproduce this issue on another machine having two GPUs.
It works well on another machine with two Quadro 6000 GPUs.
I've tried different device_map
strategies 'sequential' and 'balanced_low_0', but it still fails when two A100 GPUs are used.
I ran accelerate test
command which tests accelerate library but it also failed. It seems like a problem of accelerate
library.
I found some other people also had problems with A100 GPUs.
Related issue: https://github.com/huggingface/accelerate/issues/934
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi @younesbelkada I got the same error with two V100, with accelerate
version 0.18.0
prompt = 'Q: What is the largest animal?\nA:'
output:
A: The blue whale.
Q: What is the largest animal?
A: The blue whale. It is the largest animal on Earth. It is also the largest mammal. It is the largest creature that has ever lived.
Q: What is the largest animal?
A: The blue whale is the largest animal on Earth. It is also the largest mammal. It is the largest creature that has ever lived.
Q: What is the largest animal?
A: The blue whale is the largest animal on Earth. It is also the largest mammal. It is the largest creature that has ever lived.
Q: What is the largest animal?
A: The blue whale is the largest animal on Earth. It is also the largest mammal. It is the largest creature that has ever lived.
Q: What is the largest animal?
A: The blue whale is the largest animal on Earth. It is also the largest mammal. It is the largest creature that has ever lived.
Q: What is the largest animal?
A: The blue whale is the largest animal on Earth. It is also the largest mammal. It is the largest creature that has ever lived.
Q
code:
model_path = 'openlm-research/open_llama_3b'
tokenizer = LlamaTokenizer.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(
model_path, torch_dtype=torch.float16, device_map='auto'
)
prompt = 'Q: What is the largest animal?\nA:'
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to('cuda')
generation_output = model.generate(
input_ids=input_ids, max_length=400
)
print(tokenizer.decode(generation_output[0]))
Have you found a solution?
I think you should add the prompt which is the same one in the training. Moreover, please note the special token that you add. Example: In the training, I tokenize:
`f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n ### Input: <s>{input}</s>. \n### Response: <s>{ouput}</s>"`
Afterward, I used the model:
text = f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n ### Input: {input}. \n### Response: "
batch = tokenizer(text, return_tensors='pt', padding=True, return_token_type_ids=False)
with torch.cuda.amp.autocast():
output_tokens = model.generate(**batch, max_new_tokens=500)
decode = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
decode_text = decode[len(text):]
print(decode_text)
Hope to help you!
It works well on another machine with two Quadro 6000 GPUs. I've tried different
device_map
strategies 'sequential' and 'balanced_low_0', but it still fails when two A100 GPUs are used.I ran
accelerate test
command which tests accelerate library but it also failed. It seems like a problem ofaccelerate
library. I found some other people also had problems with A100 GPUs. Related issue: huggingface/accelerate#934
@youngwoo-yoon hi, have you solved this problem? I have the same problem on A100
I'm also running into a similar issue, except with A6000s. With 1 A6000 and the rest of the weights on cpu, I get coherent text. With multiple A6000s, I get garbage outputs.
I solved this problem by disabling ACS in BIOS. This document might be helpful to some of you. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html
I solved this problem by disabling ACS in BIOS. This document might be helpful to some of you. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html
Amazing!!! It works for me.