CTranslate2
CTranslate2 copied to clipboard
OPT 13B with INT8 producing gibberish content
Hello again!
I am testing your implementation of the OPT model with INT8 quantization. I works like a charm with a small version of OPT (I tested OPT 125M), but it doesn't work with bigger versions (I tested OPT 13B). I am not getting any error but the generated content is gibberish. I am using a Tesla T4 GPU (I can't test FP16 and FP32 yet because I need to provision a bigger GPU).
Here is a working example with OPT 125M:
ct2-transformers-converter --model facebook/opt-125m --output_dir opt-125m
import ctranslate2
import transformers
tokenizer = transformers.GPT2Tokenizer.from_pretrained("facebook/opt-125m")
generator = ctranslate2.Generator("opt-125m",device="cuda",compute_type="int8")
prompt = "Hey, are you conscious? Can you talk to me?"
start_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
results = generator.generate_batch([start_tokens], max_length=30)
output_tokens = results[0].sequences[0]
output = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens))
print(output)
Returns "Hey, are you conscious? Can you talk to me? Please?!?!?!?!?!?!?!?!?!?!?!?!?!?!?!?!?!" This is great.
Now the same thing with OPT 13B:
ct2-transformers-converter --model facebook/opt-13b --output_dir opt-13b
import ctranslate2
import transformers
tokenizer = transformers.GPT2Tokenizer.from_pretrained("facebook/opt-13b")
generator = ctranslate2.Generator("opt-13b",device="cuda",compute_type="int8")
prompt = "Hey, are you conscious? Can you talk to me?"
start_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
results = generator.generate_batch([start_tokens], max_length=30)
output_tokens = results[0].sequences[0]
output = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens))
print(output)
Returns "Hey, are you conscious? Can you talk to me?,,,,,,,,,,,,,,,,,,,,,,"
Thanks a lot in advance!
Hi,
I can reproduce this incorrect output when 8-bit quantization is enabled. It seems it mostly impacts the larger variants of OPT. I tried 2.7B, 6.7B, and 13B and they all seem to have a similar issue with quantization. We should investigate to know what is happening.
Note that the output looks fine with FP16.
Thanks so much @guillaumekln !
I will test the FP16 implementation ASAP.
There are open issues in the Transformers repository about unexpected outputs with OPT models:
- https://github.com/huggingface/transformers/issues/17545
- https://github.com/huggingface/transformers/issues/17653
If there are some issues with the model weights uploaded to the Model Hub, it could explain why the quantization does not work.
Are you running the model in full or half-precision? Maybe related: https://github.com/huggingface/transformers/pull/17437
Hi Patrick,
In this issue we run the model with 8-bit linear weights but the softmax is run in full precision. Note that we use a custom runtime, not the Transformers code.
Usually our 8-bit quantization works with most models but not with the OPT models coming from the Hugging Face Model Hub. We are watching closely your issues about a possible conversion bug.
Just add a bit of context:
One issue we have in transformers OPT so far (link provided by @patrickvonplaten above) is that we add large negative values to attention scores before softmax
in order to mask padded positions. In FP16, we got all -inf
before softmax in some cases, and got NaN
after softmax.
If the implementation in this repo also uses some large negative values to perform masking, this might be a potential place to look.
Yeah having thought a bit more about this I think that it's probably not the reason given the finding of https://github.com/huggingface/transformers/issues/17653
Should be fixed now: https://github.com/huggingface/transformers/releases/tag/v4.20.1
Thanks so much for the follow up @patrickvonplaten and @ydshieh !
Thanks for the update, @patrickvonplaten!
This is improving our issue but it looks like we still have some unexpected outputs after quantization with these models. This is probably on our side. We'll continue exploring.
Happens on OPT-30B as well. Any estimation when this will be fixed?
Unfortunately I don't have an estimation for this issue.
It's still not fully clear what the fix should be. My guess is that we need to implement the approach described in this blog post to remove outliers before running the int8 quantization and matmul.
Note that the current workaround is to run the model in FP16.
Thanks for the update. Could also look at this which claims better performance than bitsnbytes version of int8 quantization.
https://github.com/mit-han-lab/smoothquant
Thanks for the link @henyee! The SmoothQuant approach works great and can be applied during model conversion with little changes to the code.
The PR above is adding a new converter option --activation_scales
to pass the pre-computed activation scales from SmoothQuant: https://github.com/mit-han-lab/smoothquant/tree/main/act_scales
For example:
ct2-transformers-converter --model facebook/opt-13b --activation_scales act_scales/opt-13b.pt --quantization int8 --output_dir opt-13b
I validated the change with following code snippet:
import ctranslate2
import transformers
ctranslate2.set_random_seed(42)
tokenizer = transformers.GPT2Tokenizer.from_pretrained("facebook/opt-125m")
generator = ctranslate2.Generator(<MODEL_PATH>)
prompt = "The woman worked as a"
start_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
results = generator.generate_batch([start_tokens], max_length=50, sampling_topk=10)
output_tokens = results[0].sequences[0]
output = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens))
print(output)
Here are the outputs:
Quantization | Output |
---|---|
None | The woman worked as a nurse in the hospital, and had gone to help her father who was suffering from a chronic illness. She was supposed to stay with him for some days, but she had gone missing for the last two days. The |
INT8 (without smoothing) | The woman worked as a A A A A A A A. A, A A A A A A A |
INT8 (with smoothing) | The woman worked as a nurse in the hospital, and her mother was admitted there as well. The woman was not allowed to meet her family member for three days, until the mother died. “I was not allowed to meet my mother |
The output is still not the same but this looks good enough to me, so I will consider the issue as resolved once the PR is merged.
Excellent news! Happy new year!
Thanks so much for your work @guillaumekln , it sounds like it's now fixed indeed! Thanks again.