CTranslate2 OPT 13B with INT8 producing gibberish content

Hello again!

I am testing your implementation of the OPT model with INT8 quantization. I works like a charm with a small version of OPT (I tested OPT 125M), but it doesn't work with bigger versions (I tested OPT 13B). I am not getting any error but the generated content is gibberish. I am using a Tesla T4 GPU (I can't test FP16 and FP32 yet because I need to provision a bigger GPU).

Here is a working example with OPT 125M:

ct2-transformers-converter --model facebook/opt-125m --output_dir opt-125m

import ctranslate2
import transformers

tokenizer = transformers.GPT2Tokenizer.from_pretrained("facebook/opt-125m")
generator = ctranslate2.Generator("opt-125m",device="cuda",compute_type="int8")

prompt = "Hey, are you conscious? Can you talk to me?"
start_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))

results = generator.generate_batch([start_tokens], max_length=30)
output_tokens = results[0].sequences[0]

output = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens))
print(output)

Returns "Hey, are you conscious? Can you talk to me? Please?!?!?!?!?!?!?!?!?!?!?!?!?!?!?!?!?!" This is great.

Now the same thing with OPT 13B:

ct2-transformers-converter --model facebook/opt-13b --output_dir opt-13b

import ctranslate2
import transformers

tokenizer = transformers.GPT2Tokenizer.from_pretrained("facebook/opt-13b")
generator = ctranslate2.Generator("opt-13b",device="cuda",compute_type="int8")

prompt = "Hey, are you conscious? Can you talk to me?"
start_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))

results = generator.generate_batch([start_tokens], max_length=30)
output_tokens = results[0].sequences[0]

output = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens))
print(output)

Returns "Hey, are you conscious? Can you talk to me?,,,,,,,,,,,,,,,,,,,,,,"

Thanks a lot in advance!

May 30 '22 19:05 juliensalinas

Hi,

I can reproduce this incorrect output when 8-bit quantization is enabled. It seems it mostly impacts the larger variants of OPT. I tried 2.7B, 6.7B, and 13B and they all seem to have a similar issue with quantization. We should investigate to know what is happening.

Note that the output looks fine with FP16.

May 31 '22 15:05 guillaumekln

Thanks so much @guillaumekln !

I will test the FP16 implementation ASAP.

May 31 '22 18:05 juliensalinas

There are open issues in the Transformers repository about unexpected outputs with OPT models:

https://github.com/huggingface/transformers/issues/17545
https://github.com/huggingface/transformers/issues/17653

If there are some issues with the model weights uploaded to the Model Hub, it could explain why the quantization does not work.

Jun 10 '22 15:06 guillaumekln

Are you running the model in full or half-precision? Maybe related: https://github.com/huggingface/transformers/pull/17437

Jun 10 '22 16:06 patrickvonplaten

Hi Patrick,

In this issue we run the model with 8-bit linear weights but the softmax is run in full precision. Note that we use a custom runtime, not the Transformers code.

Usually our 8-bit quantization works with most models but not with the OPT models coming from the Hugging Face Model Hub. We are watching closely your issues about a possible conversion bug.

Jun 10 '22 16:06 guillaumekln

Just add a bit of context:

One issue we have in transformers OPT so far (link provided by @patrickvonplaten above) is that we add large negative values to attention scores before softmax in order to mask padded positions. In FP16, we got all -inf before softmax in some cases, and got NaN after softmax.

If the implementation in this repo also uses some large negative values to perform masking, this might be a potential place to look.

Jun 10 '22 21:06 ydshieh

Yeah having thought a bit more about this I think that it's probably not the reason given the finding of https://github.com/huggingface/transformers/issues/17653

Jun 14 '22 12:06 patrickvonplaten

Should be fixed now: https://github.com/huggingface/transformers/releases/tag/v4.20.1

Jun 21 '22 23:06 patrickvonplaten

Thanks so much for the follow up @patrickvonplaten and @ydshieh !

Jun 22 '22 05:06 juliensalinas

Thanks for the update, @patrickvonplaten!

This is improving our issue but it looks like we still have some unexpected outputs after quantization with these models. This is probably on our side. We'll continue exploring.

Jun 22 '22 09:06 guillaumekln

Happens on OPT-30B as well. Any estimation when this will be fixed?

Nov 29 '22 11:11 henyee

Unfortunately I don't have an estimation for this issue.

It's still not fully clear what the fix should be. My guess is that we need to implement the approach described in this blog post to remove outliers before running the int8 quantization and matmul.

Note that the current workaround is to run the model in FP16.

Nov 29 '22 13:11 guillaumekln

Thanks for the update. Could also look at this which claims better performance than bitsnbytes version of int8 quantization.

https://github.com/mit-han-lab/smoothquant

Nov 30 '22 02:11 henyee

Thanks for the link @henyee! The SmoothQuant approach works great and can be applied during model conversion with little changes to the code.

The PR above is adding a new converter option --activation_scales to pass the pre-computed activation scales from SmoothQuant: https://github.com/mit-han-lab/smoothquant/tree/main/act_scales

For example:

ct2-transformers-converter --model facebook/opt-13b --activation_scales act_scales/opt-13b.pt --quantization int8 --output_dir opt-13b

I validated the change with following code snippet:

import ctranslate2
import transformers

ctranslate2.set_random_seed(42)

tokenizer = transformers.GPT2Tokenizer.from_pretrained("facebook/opt-125m")
generator = ctranslate2.Generator(<MODEL_PATH>)

prompt = "The woman worked as a"
start_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))

results = generator.generate_batch([start_tokens], max_length=50, sampling_topk=10)

output_tokens = results[0].sequences[0]
output = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens))
print(output)

Here are the outputs:

Quantization	Output
None	The woman worked as a nurse in the hospital, and had gone to help her father who was suffering from a chronic illness. She was supposed to stay with him for some days, but she had gone missing for the last two days. The
INT8 (without smoothing)	The woman worked as a A A A A A A A. A, A A A A A A A
INT8 (with smoothing)	The woman worked as a nurse in the hospital, and her mother was admitted there as well. The woman was not allowed to meet her family member for three days, until the mother died. “I was not allowed to meet my mother

The output is still not the same but this looks good enough to me, so I will consider the issue as resolved once the PR is merged.

Dec 30 '22 15:12 guillaumekln

Excellent news! Happy new year!

Dec 31 '22 15:12 henyee

Thanks so much for your work @guillaumekln , it sounds like it's now fixed indeed! Thanks again.

Jan 01 '23 08:01 juliensalinas

CTranslate2 CTranslate2 copied to clipboard

OPT 13B with INT8 producing gibberish content

CTranslate2
CTranslate2 copied to clipboard