transformers OverflowError with device="mps" using dedicated GPU

System Info

2019 Mac Pro
AMD Radeon Pro W5700X 16 GB
macOS Ventura 13.3

transformers-cli env:

transformers version: 4.27.4
Platform: macOS-10.16-x86_64-i386-64bit
Python version: 3.9.16
Huggingface_hub version: 0.13.3
PyTorch version (GPU?): 2.1.0.dev20230403 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Shell:

conda create -n transformerstest
conda activate transformerstest
conda install -c huggingface transformers
conda install pytorch torchvision torchaudio -c pytorch-nightly

Python:

from transformers import pipeline

generator = pipeline("text-generation", device="mps")
generator("In this course, we will teach you how to")

The system is then compiling Metal shaders and doing something on the GPU, but the result is:

Traceback (most recent call last):
  File "/Users/fabian/devel/transformers-course/test.py", line 4, in <module>
    generator("In this course, we will teach you how to")
  File "/usr/local/Caskroom/miniconda/base/envs/transformerstest/lib/python3.9/site-packages/transformers/pipelines/text_generation.py", line 209, in __call__
    return super().__call__(text_inputs, **kwargs)
  File "/usr/local/Caskroom/miniconda/base/envs/transformerstest/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1109, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
  File "/usr/local/Caskroom/miniconda/base/envs/transformerstest/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1117, in run_single
    outputs = self.postprocess(model_outputs, **postprocess_params)
  File "/usr/local/Caskroom/miniconda/base/envs/transformerstest/lib/python3.9/site-packages/transformers/pipelines/text_generation.py", line 270, in postprocess
    text = self.tokenizer.decode(
  File "/usr/local/Caskroom/miniconda/base/envs/transformerstest/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 3476, in decode
    return self._decode(
  File "/usr/local/Caskroom/miniconda/base/envs/transformerstest/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 549, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted

Expected behavior

Generating output. This works on a MacBook Pro M1 with device="mps" (utilizing the GPU AFAICT) or on the Mac Pro without it (not utilizing GPU).

Thanks for your support!

Apr 04 '23 06:04 cmdrf

This looks similar to #22529 and this is not a bug in Transformers but in PyTorch, so you will have to wait for them to release a fix.

Apr 04 '23 13:04 sgugger

Thanks for the quick answer!

Not holding my breath for a fix though. It's one out of 10K+ open issues in pytorch...

Apr 05 '23 10:04 cmdrf

Thanks for the quick answer!

Not holding my breath for a fix though. It's one out of 10K+ open issues in pytorch...

Yeah that's the same issue. It just got marked high priority a few minutes ago so they're definitely looking at it.

In the meantime you can get it working if you make some manual fixes to your local copy of transformers. Not pretty, but it works.

In brief, I worked around it locally by searching <python-install>/lib/python3.X/site-packages/transformers for all references to argmax, and changing all relevant references such that X.argmax(...) is changed to X.max(...).indices. I think I changed it in 5 or 6 files total. Which references are relevant will depend on what you're doing. There's a ton of references under models/ but you'd only need to change the ones you might actually need. I'm currently only looking at Llama models and there were no calls to argmax under models/llama so I didn't change any files under models/.

If you want to try that I can send you a list of files I had to changed, relative to 4.28.0.dev0

Then you'd also need check your client code to see if it's making any of its own calls to torch.argmax, and change those too.

Finally, if you're using an Intel system with AMD GPU, then due to separate issue https://github.com/pytorch/pytorch/issues/92752 you also need to check for calls to torch.multinomial and rewrite those. There weren't any in transformers that affected me, but there was one in the client code I was using. I described how I changed that here: https://github.com/jankais3r/LLaMA_MPS/issues/14#issuecomment-1494959026 . Apparently Silicon systems aren't affected by this bug.

It's a bit of a mess at the moment due to those MPS bugs - but it is possible to get it working if you're willing to hack transformers and check your client code.

Apr 05 '23 17:04 TheBloke

It just got marked high priority a few minutes ago so they're definitely looking at it.

I pinged the PyTorch team on it ;-)

Apr 05 '23 17:04 sgugger

Much appreciated!

Apr 05 '23 17:04 TheBloke

Actually running LLaMa was my goal, I was just trying something simpler first.

Now I tried LLaMa using the following:

from transformers import AutoTokenizer, LlamaForCausalLM, pipeline

model = LlamaForCausalLM.from_pretrained("/path/to/models/llama-7b/")
tokenizer = AutoTokenizer.from_pretrained("/path/to/models/llama-7b/")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device="mps")
pipe("In this course, we will teach you how to")

Result:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Caskroom/miniconda/base/envs/textgen/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 209, in __call__
    return super().__call__(text_inputs, **kwargs)
  File "/usr/local/Caskroom/miniconda/base/envs/textgen/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1109, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
  File "/usr/local/Caskroom/miniconda/base/envs/textgen/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1117, in run_single
    outputs = self.postprocess(model_outputs, **postprocess_params)
  File "/usr/local/Caskroom/miniconda/base/envs/textgen/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 270, in postprocess
    text = self.tokenizer.decode(
  File "/usr/local/Caskroom/miniconda/base/envs/textgen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3485, in decode
    return self._decode(
  File "/usr/local/Caskroom/miniconda/base/envs/textgen/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 931, in _decode
    filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
  File "/usr/local/Caskroom/miniconda/base/envs/textgen/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 912, in convert_ids_to_tokens
    tokens.append(self._convert_id_to_token(index))
  File "/usr/local/Caskroom/miniconda/base/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 119, in _convert_id_to_token
    token = self.sp_model.IdToPiece(index)
  File "/usr/local/Caskroom/miniconda/base/envs/textgen/lib/python3.10/site-packages/sentencepiece/__init__.py", line 1045, in _batched_func
    return _func(self, arg)
  File "/usr/local/Caskroom/miniconda/base/envs/textgen/lib/python3.10/site-packages/sentencepiece/__init__.py", line 1038, in _func
    raise IndexError('piece id is out of range.')
IndexError: piece id is out of range.

Which sounds like "minus nine trillion something" indices happening somewhere again. I didn't find "multinomial" or "argmax" under models/llama, but it's possible of course that those functions are called somewhere else.

Apr 07 '23 16:04 cmdrf

Which sounds like "minus nine trillion something" indices happening somewhere again. I didn't find "multinomial" or "argmax" under models/llama, but it's possible of course that those functions are called somewhere else.

Yes, it is not referenced anywhere under models/llama but is referenced multiple other places throughout transformers. In my earlier reply I described the process I followed to change those.

That test code works for me with my locally hacked copy of transformers.

Code:

from transformers import LlamaTokenizer, LlamaForCausalLM, pipeline

model = LlamaForCausalLM.from_pretrained("/Users/tomj/src/llama.cpp/models/llama-7b-HF")
tokenizer = LlamaTokenizer.from_pretrained("/Users/tomj/src/llama.cpp/models/llama-7b-HF")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device="mps")
print(pipe("In this course, we will teach you how to"))

Output:

tomj@Eddie ~/src $ ~/anaconda3/envs/torch21/bin/python ./test_llama.py
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:20<00:00,  1.61it/s]
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
/Users/tomj/anaconda3/envs/torch21/lib/python3.10/site-packages/transformers/generation/utils.py:1219: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
/Users/tomj/anaconda3/envs/torch21/lib/python3.10/site-packages/transformers/generation/utils.py:1313: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
[{'generated_text': 'In this course, we will teach you how to use the most popular and powerful tools in the industry'}]

Apr 07 '23 16:04 TheBloke

Same error with torch nightly version: 2.1.0.dev20230428 and 'MPS' on a 2020 iMac 27" with an AMD Radeon 5700 XT gpu in

https://github.com/andreamad8/FSB

Apr 30 '23 01:04 dbl001

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 24 '23 15:05 github-actions[bot]

transformers transformers copied to clipboard

OverflowError with device="mps" using dedicated GPU

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard