smoothquant
smoothquant copied to clipboard
How to implement this method combinded with decoder
Details:
1、the open souce code only support first round text input, if we set the past_key_values=True, and we can not got the error because the dims of attention mask not match:
The follows code can not works normal as the opt-6.7b before quantization
from opt import Int8OPTForCausalLM
from transformers import AutoTokenizer
model_smoothquant = Int8OPTForCausalLM.from_pretrained(
'/data1/lileilai/opt-6.7b-smoothquant/', torch_dtype=torch.float16, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained("/data1/lileilai/opt-6.7b/")
input_sentences = ['In the last couple of days, a',
'The New Jersey Department of Transportation is aware',
'The New York Giants have a new head',
'The New York Times has published its annual',
'In a move that will likely make it',
"The New York Giants' offensive linemen",
'The Canadian Press has unanimously condemned the new',
'The first time I saw the movie,'
]
batch_size = 8
inputs = input_sentences[: 8]
generate_kwargs = dict(max_new_tokens=100, do_sample=False)
def generate(model=None):
input_tokens = tokenizer.batch_encode_plus(input_sentences, return_tensors="pt", padding=True)
for t in input_tokens:
if torch.is_tensor(input_tokens[t]):
input_tokens[t] = input_tokens[t].to("cuda:0")
outputs = model.generate(**input_tokens, **generate_kwargs)
input_tokens_lengths = [x.shape[0] for x in input_tokens.input_ids]
output_tokens_lengths = [x.shape[0] for x in outputs]
total_new_tokens = [o - i for i, o in zip(input_tokens_lengths, output_tokens_lengths)]
outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
return zip(inputs, outputs, total_new_tokens)
generate(model=model_smoothquant)

Any progress to fix this issue? @Guangxuan-Xiao
did you figure out how to use model.generate with smoothquant? I executed your code and got
ValueError: The provided attention mask has length 25, but its length should be 32 (sum of the lengths of current and past inputs)