optimum-intel Quantized `flan-t5-large` RuntimeError - empty_strided not supported on quantized tensors yet

I have applied dynamic quantization to a flan-t5-large model. However, when I try to evaluate the generated summaries I get this error:

RuntimeError: empty_strided not supported on quantized tensors yet see https://github.com/pytorch/pytorch/issues/74540

Code:

from optimum.intel.neural_compressor import INCModelForSeq2SeqLM

model = INCModelForSeq2SeqLM.from_pretrained(model_name).to(device)

for examples_chunk in tqdm(list(chunks(examples, batch_size))):
        examples_chunk = [prefix + text for text in examples_chunk]
        batch = tokenizer(examples_chunk, return_tensors="pt", truncation=True, padding="longest").to(device)
        summaries = model.generate(
                    input_ids=batch.input_ids,
                    attention_mask=batch.attention_mask,
                    **generate_kwargs,
        )

Dependencies:

transformers                 4.26.1
neural-compressor            2.1
optimum-intel                1.7.3
torch                        2.0.0

Traceback:

│ 61 │ for examples_chunk in tqdm(list(chunks(examples, batch_size))): │ │ 62 │ │ examples_chunk = [prefix + text for text in examples_chunk] │ │ 63 │ │ batch = tokenizer(examples_chunk, return_tensors="pt", truncation=True, padding= │ │ ❱ 64 │ │ summaries = model.generate( │ │ 65 │ │ │ input_ids=batch.input_ids, │ │ 66 │ │ │ attention_mask=batch.attention_mask, │ │ 67 │ │ │ **generate_kwargs, │ │ │ │ /home/mrshu/miniconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py:115 in │ │ decorate_context │ │ │ │ 112 │ @functools.wraps(func) │ │ 113 │ def decorate_context(*args, **kwargs): │ │ 114 │ │ with ctx_factory(): │ │ ❱ 115 │ │ │ return func(*args, **kwargs) │ │ 116 │ │ │ 117 │ return decorate_context │ │ 118 │ │ │ │ /home/mrshu/miniconda3/lib/python3.9/site-packages/transformers/generation/utils.py:1252 in │ │ generate │ │ │ │ 1249 │ │ if self.config.is_encoder_decoder and "encoder_outputs" not in model_kwargs: │ │ 1250 │ │ │ # if model is encoder decoder encoder_outputs are created │ │ 1251 │ │ │ # and added to model_kwargs │ │ ❱ 1252 │ │ │ model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation( │ │ 1253 │ │ │ │ inputs_tensor, model_kwargs, model_input_name │ │ 1254 │ │ │ ) │ │ 1255 │ │ │ │ /home/mrshu/miniconda3/lib/python3.9/site-packages/transformers/generation/utils.py:617 in │ │ _prepare_encoder_decoder_kwargs_for_generation │ │ │ │ 614 │ │ model_input_name = model_input_name if model_input_name is not None else self.ma │ │ 615 │ │ encoder_kwargs["return_dict"] = True │ │ 616 │ │ encoder_kwargs[model_input_name] = inputs_tensor │ │ ❱ 617 │ │ model_kwargs["encoder_outputs"]: ModelOutput = encoder(**encoder_kwargs) │ │ 618 │ │ │ │ 619 │ │ return model_kwargs │ │ 620 │ │ │ │ /home/mrshu/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py:1501 in _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /home/mrshu/miniconda3/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py:1055 in │ │ forward │ │ │ │ 1052 │ │ │ │ │ None, # past_key_value is always None with gradient checkpointing │ │ 1053 │ │ │ │ ) │ │ 1054 │ │ │ else: │ │ ❱ 1055 │ │ │ │ layer_outputs = layer_module( │ │ 1056 │ │ │ │ │ hidden_states, │ │ 1057 │ │ │ │ │ attention_mask=extended_attention_mask, │ │ 1058 │ │ │ │ │ position_bias=position_bias, │ │ │ │ /home/mrshu/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py:1501 in _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /home/mrshu/miniconda3/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py:739 in │ │ forward │ │ │ │ 736 │ │ │ attention_outputs = attention_outputs + cross_attention_outputs[2:] │ │ 737 │ │ │ │ 738 │ │ # Apply Feed Forward layer │ │ ❱ 739 │ │ hidden_states = self.layer-1 │ │ 740 │ │ │ │ 741 │ │ # clamp inf values to enable fp16 training │ │ 742 │ │ if hidden_states.dtype == torch.float16 and torch.isinf(hidden_states).any(): │ │ │ │ /home/mrshu/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py:1501 in _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /home/mrshu/miniconda3/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py:336 in │ │ forward │ │ │ │ 333 │ │ │ 334 │ def forward(self, hidden_states): │ │ 335 │ │ forwarded_states = self.layer_norm(hidden_states) │ │ ❱ 336 │ │ forwarded_states = self.DenseReluDense(forwarded_states) │ │ 337 │ │ hidden_states = hidden_states + self.dropout(forwarded_states) │ │ 338 │ │ return hidden_states │ │ 339 │ │ │ │ /home/mrshu/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py:1501 in _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /home/mrshu/miniconda3/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py:317 in │ │ forward │ │ │ │ 314 │ │ # See https://github.com/huggingface/transformers/issues/20287 │ │ 315 │ │ # we also make sure the weights are not in int8 in case users will force `_kee │ │ 316 │ │ if hidden_states.dtype != self.wo.weight.dtype and self.wo.weight.dtype != torch │ │ ❱ 317 │ │ │ hidden_states = hidden_states.to(self.wo.weight.dtype) │ │ 318 │ │ │ │ 319 │ │ hidden_states = self.wo(hidden_states) │ │ 320 │ │ return hidden_states │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

Apr 08 '23 14:04 jmdu99

@jmdu99 Can you provide the code you used for quantization and generation, to provide the maintainers more context to address this issue?

Apr 14 '23 18:04 argideritzalpea

@jmdu99 Can you provide the code you used for quantization and generation, to provide the maintainers more context to address this issue?

Updated

Apr 14 '23 19:04 jmdu99

optimum-intel optimum-intel copied to clipboard

Quantized `flan-t5-large` RuntimeError - empty_strided not supported on quantized tensors yet

optimum-intel
optimum-intel copied to clipboard