smoothquant How to implement this method combinded with decoder

Details:

1、the open souce code only support first round text input, if we set the past_key_values=True, and we can not got the error because the dims of attention mask not match:

The follows code can not works normal as the opt-6.7b before quantization

from opt import Int8OPTForCausalLM
from transformers import AutoTokenizer
model_smoothquant = Int8OPTForCausalLM.from_pretrained(
    '/data1/lileilai/opt-6.7b-smoothquant/', torch_dtype=torch.float16, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained("/data1/lileilai/opt-6.7b/")
input_sentences = ['In the last couple of days, a',
 'The New Jersey Department of Transportation is aware',
 'The New York Giants have a new head',
 'The New York Times has published its annual',
 'In a move that will likely make it',
 "The New York Giants' offensive linemen",
 'The Canadian Press has unanimously condemned the new',
 'The first time I saw the movie,'
]
batch_size = 8
inputs = input_sentences[: 8]

generate_kwargs = dict(max_new_tokens=100, do_sample=False)
def generate(model=None):
    input_tokens = tokenizer.batch_encode_plus(input_sentences, return_tensors="pt", padding=True)
    for t in input_tokens:
        if torch.is_tensor(input_tokens[t]):
            input_tokens[t] = input_tokens[t].to("cuda:0")

    outputs = model.generate(**input_tokens, **generate_kwargs)

    input_tokens_lengths = [x.shape[0] for x in input_tokens.input_ids]
    output_tokens_lengths = [x.shape[0] for x in outputs]

    total_new_tokens = [o - i for i, o in zip(input_tokens_lengths, output_tokens_lengths)]
    outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    return zip(inputs, outputs, total_new_tokens)

generate(model=model_smoothquant)

Mar 02 '23 06:03 lileilai

Any progress to fix this issue? @Guangxuan-Xiao

Oct 16 '23 11:10 llCurious

did you figure out how to use model.generate with smoothquant? I executed your code and got

ValueError: The provided attention mask has length 25, but its length should be 32 (sum of the lengths of current and past inputs)

Mar 31 '24 08:03 Hao-YunDeng