transformers Contrastive Search in .generate() function doesn't work with Half

System Info

The CLI fails but this is irrelevant to the problem

Who can help?

@gante

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Load any model like so

model = AutoModelForCausalLM.from_pretrained(
    "<PATH>",
    torch_dtype=torch.float16,
)

Perform generation using contrastive search

gen_tokens = model.generate(
    tokenized_input.input_ids,
    top_k=4,
    penalty_alpha=0.6
)

Expected behavior

Contrastive search probably should work with torch.float16 (if not just let me know - idk if there are stability issues).

This can be fixed by adding the following code to https://github.com/huggingface/transformers/blob/25ddd91b249014d818fb2ed3d4ba856ed9a5653e/src/transformers/generation/utils.py#L1873

# conditionally convert from float16
if context_hidden.dtype == torch.float16:
    context_hidden = context_hidden.to(dtype=torch.float32)
if next_hidden.dtype == torch.float16:
    next_hidden = next_hidden.to(dtype=torch.float32)
if top_k_probs.dtype == torch.float16:
    top_k_probs = top_k_probs.to(dtype=torch.float32)

Jan 17 '23 14:01 sam-ulrich1

Hey @sam-ulrich1 👋

To be candid, fp16 was not a concern when writing contrastive search :) I've tried adding your suggested change and running the script below, but that was not enough to fix it

from transformers import GPT2Tokenizer, OPTForCausalLM
import torch

tokenizer = GPT2Tokenizer.from_pretrained("facebook/opt-350m", padding_side='left')
model = OPTForCausalLM.from_pretrained("facebook/opt-350m", torch_dtype=torch.float16)

inputs = tokenizer(["My cat is"], return_tensors="pt")

outputs = model.generate(**inputs, top_k=4, penalty_alpha=0.6)
print(tokenizer.batch_decode(outputs.sequences))

Would you be able to share a snippet of what you're trying to run? :)

Jan 17 '23 15:01 gante

Odd! It works on my machine (pun intended)!

Let me get my version and other info and I can make a PR if you'd like. That way we can work from code not just snippets

Jan 17 '23 15:01 sam-ulrich1

@gante Could you share your traceback? I'll take a look at this later today

Jan 17 '23 15:01 sam-ulrich1

@sam-ulrich1 haha roles reversed, usually I'm the one asking for tracebacks!

Traceback (most recent call last):
  File "/home/joao/transformers/../joao_scripts/dbg.py", line 17, in <module>
    outputs = model.generate(**inputs, top_k=4, penalty_alpha=0.6)
  File "/home/joao/hf/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/joao/transformers/src/transformers/generation/utils.py", line 1372, in generate
    return self.contrastive_search(
  File "/home/joao/hf/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/joao/transformers/src/transformers/generation/utils.py", line 1769, in contrastive_search
    outputs = self(
  File "/home/joao/hf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/joao/transformers/src/transformers/models/opt/modeling_opt.py", line 934, in forward
    outputs = self.model.decoder(
  File "/home/joao/hf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/joao/transformers/src/transformers/models/opt/modeling_opt.py", line 645, in forward
    inputs_embeds = self.project_in(inputs_embeds)
  File "/home/joao/hf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/joao/hf/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

Jan 17 '23 15:01 gante

Ya I got a kick out of that too!

It actually looks like that is an OPT issue with Half. I'm playing around with CodeGen so that would be my reference but I know other models are affected as well. Currently the problem I'm targeting is "baddbmm_with_gemm" not implemented for 'Half'

I'll take a look at the OPT thing as well but if it's out of scope I'll probably start another issue to keep the tracking simple.

Jan 17 '23 15:01 sam-ulrich1

@gante I'm not gonna get this done today but I'll get it knocked out by the end of the week. I just have a bit busier week than I expected

Jan 18 '23 01:01 sam-ulrich1

@sam-ulrich1 no worries :) and let me know if you need a hand!

Jan 18 '23 10:01 gante

@gante How do I run the tests in the repo? I added the below test at the below link so that I can validate my fix. I want to run this test on the CodeGen model but I've never worked with a testing setup like this https://github.com/huggingface/transformers/blob/0359e2e15f4504513fd2995bdd6dd654c747b313/tests/generation/test_utils.py#L1432

    def test_contrastive_generate_fp16(self):
        # check `generate()` and `contrastive_search()` are equal
        for model_class in self.all_generative_model_classes:

            # won't fix: FSMT and Reformer have a different cache variable type (and format).
            if any(model_name in model_class.__name__.lower() for model_name in ["fsmt", "reformer"]):
                return

            config, input_ids, attention_mask, max_length = self._get_input_ids_and_config()

            # NOTE: contrastive search only works with cache on at the moment.
            if not hasattr(config, "use_cache"):
                return
            config.use_cache = True
            config.is_decoder = True
            config.torch_dtype = torch.float16

            # test old generation output for backwards compatibility
            model = model_class(config).to(torch_device).eval()
            output_contrastive, output_generate = self._contrastive_generate(
                model=model, input_ids=input_ids, attention_mask=attention_mask, max_length=max_length
            )
            self.assertListEqual(output_contrastive.tolist(), output_generate.tolist())

Jan 19 '23 14:01 sam-ulrich1

@sam-ulrich1 try py.test tests/ -k contrastive_generate_fp16 -vv, assuming you are in .../transformers/.

(tests/ is the folder containing the test files, -k filters tests by name, contrastive_generate_fp16 is the test name filter based on your test name)

Jan 19 '23 14:01 gante

Thanks!

Jan 19 '23 14:01 sam-ulrich1

@gante Okay it seems to be fixed but there is one model that fails the test for (what appears to be) a unrelated problem. What's the procedure for this? Can y'all accept a PR if all the tests don't pass?

Here's the failing model:

FAILED tests/models/git/test_modeling_git.py::GitModelTest::test_contrastive_generate_fp16 - RuntimeError: output with shape [10, 1, 1, 1] doesn't match the broadcast shape [10, 1, 1, 4]

And pytest stack trace+

___________________________________________________________________________________________________________ GitModelTest.test_contrastive_generate_fp16 ____________________________________________________________________________________________________________

self = <tests.models.git.test_modeling_git.GitModelTest testMethod=test_contrastive_generate_fp16>

    def test_contrastive_generate_fp16(self):
        # check `generate()` and `contrastive_search()` are equal
        for model_class in self.all_generative_model_classes:
    
            # won't fix: FSMT and Reformer have a different cache variable type (and format).
            if any(model_name in model_class.__name__.lower() for model_name in ["fsmt", "reformer"]):
                return
    
            config, input_ids, attention_mask, max_length = self._get_input_ids_and_config()
    
            # NOTE: contrastive search only works with cache on at the moment.
            if not hasattr(config, "use_cache"):
                return
            config.use_cache = True
            config.is_decoder = True
            config.torch_dtype = torch.float16
            print(config)
    
            # test old generation output for backwards compatibility
            model = model_class(config).to(torch_device).eval()
>           output_contrastive, output_generate = self._contrastive_generate(
                model=model, input_ids=input_ids, attention_mask=attention_mask, max_length=max_length
            )

tests/generation/test_utils.py:1453: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/generation/test_utils.py:655: in _contrastive_generate
    output_generate = model.generate(
../../../anaconda3/envs/transformers/lib/python3.9/site-packages/torch/autograd/grad_mode.py:27: in decorate_context
    return func(*args, **kwargs)
src/transformers/generation/utils.py:1321: in generate
    return self.contrastive_search(
../../../anaconda3/envs/transformers/lib/python3.9/site-packages/torch/autograd/grad_mode.py:27: in decorate_context
    return func(*args, **kwargs)
src/transformers/generation/utils.py:1804: in contrastive_search
    outputs = self(
../../../anaconda3/envs/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py:1194: in _call_impl
    return forward_call(*input, **kwargs)
src/transformers/models/git/modeling_git.py:1478: in forward
    outputs = self.git(
../../../anaconda3/envs/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py:1194: in _call_impl
    return forward_call(*input, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = GitModel(
  (embeddings): GitEmbeddings(
    (word_embeddings): Embedding(99, 32, padding_idx=98)
    (position_embedd...n_features=768, out_features=32, bias=True)
      (1): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
    )
  )
)
input_ids = tensor([[36],
        [64],
        [41],
        [89],
        [58],
        [72],
        [41],
        [ 2],
        [36],
        [64]], device='cuda:0')
attention_mask = tensor([[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]], device='cuda:0')
position_ids = None, pixel_values = None, head_mask = [None, None, None, None, None], inputs_embeds = None, past_key_values = None, use_cache = True, output_attentions = False, output_hidden_states = True, return_dict = True

    @add_start_docstrings_to_model_forward(GIT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
    @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=_CONFIG_FOR_DOC)
    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        pixel_values: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPooling]:
        r"""
        past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
    
            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
        use_cache (`bool`, *optional*):
            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
            `past_key_values`).
    
        Returns:
    
        Examples:
    
        ```python
        >>> from transformers import AutoProcessor, AutoModel
        >>> import requests
        >>> from PIL import Image
    
        >>> processor = AutoProcessor.from_pretrained("microsoft/git-base")
        >>> model = AutoModel.from_pretrained("microsoft/git-base")
    
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
    
        >>> text = "this is an image of two cats"
    
        >>> inputs = processor(text, images=image, return_tensors="pt")
    
        >>> outputs = model(**inputs)
        >>> last_hidden_state = outputs.last_hidden_state
        ```"""
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        use_cache = use_cache if use_cache is not None else self.config.use_cache
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        elif input_ids is not None:
            input_shape = input_ids.size()
        elif inputs_embeds is not None:
            input_shape = inputs_embeds.size()[:-1]
        else:
            raise ValueError("You have to specify either input_ids or inputs_embeds")
    
        seq_length = input_shape[1]
    
        # past_key_values_length
        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
    
        # Prepare head mask if needed
        # 1.0 in head_mask indicate we keep the head
        # attention_probs has shape bsz x n_heads x N x N
        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
    
        projected_visual_features = None
        if pixel_values is not None:
            if pixel_values.ndim == 4:
                # here we assume pixel_values is of shape (batch_size, num_channels, height, width)
                visual_features = self.image_encoder(pixel_values).last_hidden_state
    
            elif pixel_values.ndim == 5:
                # here we assume pixel_values is of shape (batch_size, num_frames, num_channels, height, width)
                visual_features = []
                for frame_idx in range(pixel_values.shape[1]):
                    visual_features_frame = self.image_encoder(pixel_values[:, frame_idx, :, :]).last_hidden_state
                    visual_features_frame += self.img_temperal_embedding[frame_idx]
                    visual_features.append(visual_features_frame)
    
                # finally, concatenate all features along sequence dimension
                visual_features = torch.cat(visual_features, dim=1)
    
            else:
                raise ValueError("pixel_values must be of rank 4 or 5")
    
            projected_visual_features = self.visual_projection(visual_features)
    
        embedding_output = self.embeddings(
            input_ids=input_ids,
            position_ids=position_ids,
            inputs_embeds=inputs_embeds,
            past_key_values_length=past_key_values_length,
        )
    
        if projected_visual_features is None:
            projected_visual_features = torch.zeros(
                (embedding_output.shape[0], 0, embedding_output.shape[2]),
                dtype=embedding_output.dtype,
                device=embedding_output.device,
            )
    
        # Repeat visual features to match embedding batch size.
        projected_visual_features = projected_visual_features.repeat(
            embedding_output.size(0) // projected_visual_features.size(0), 1, 1
        )
    
        # concatenate patch token and text token embeddings
        hidden_states = torch.cat((projected_visual_features, embedding_output), dim=1)
    
        # By default, an additive causal mask is created
        # for masking the future (one direction).
        tgt_mask = self._generate_future_mask(seq_length, embedding_output.dtype, embedding_output.device)
    
        # Create an attention mask of shape (batch_size, 1, tgt_seq_len, src_seq_len)
        combined_attention_mask = self.create_attention_mask(
            tgt=embedding_output,
            memory=projected_visual_features,
            tgt_mask=tgt_mask,
            past_key_values_length=past_key_values_length,
        )
    
        if attention_mask is not None:
            # if the user provides an attention mask, we add it to the default one
            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
            expanded_attn_mask = _expand_mask(attention_mask, embedding_output.dtype, tgt_len=input_shape[-1]).to(
                embedding_output.device
            )
            if past_key_values_length > 0:
                expanded_attn_mask = expanded_attn_mask[:, :, -past_key_values_length:, :]
            else:
>               combined_attention_mask[:, :, -input_shape[1] :, -input_shape[1] :] += expanded_attn_mask
E               RuntimeError: output with shape [10, 1, 1, 1] doesn't match the broadcast shape [10, 1, 1, 4]

Jan 19 '23 15:01 sam-ulrich1

Oh yeah, GIT is a bit different -- it's a multimodal model that requires careful manipulations at generate time. Open a PR with what you have now, I think I can figure out what's wrong with GIT after I have access to the changes :)

Jan 20 '23 10:01 gante

Jumping here, the error RuntimeError: "addmm_impl_cpu_" not implemented for 'Half' is just that Half only works on GPU and should not be used on cpu 😉

Jan 20 '23 15:01 ArthurZucker

That would make a lot of sense! I didn't address that error in this fix. I focused on "baddbmm_with_gemm" not implemented for 'Half' but I can take a look at that error over the weekend if you'd like

Jan 20 '23 15:01 sam-ulrich1

@gante Fix is here rebased to latest commit on main but the PR guidelines are kinda long so I won't be able to create the PR until later https://github.com/gage-technologies/transformers

Jan 20 '23 15:01 sam-ulrich1

I am having this issue as well. I tried 4.26 and 4.25.1. I am gonna try @sam-ulrich1 solution.

Feb 09 '23 06:02 mallorbc

The fix did not help. Neither using DeepSpeed nor using vanilla Transformers. Using bfloat16 gives me expected results(but I need float16 for DeepSpeed)

Feb 09 '23 07:02 mallorbc

I take back what I said. I am not having this issue at all. With or withou t @sam-ulrich1 fix, it is working fine. The issue is with DeepSpeed.

Feb 09 '23 07:02 mallorbc

I'm also facing a similar issue:

generator = pipeline("text2text-generation", model="philschmid/flan-t5-xxl-sharded-fp16", model_kwargs={"load_in_8bit":True, "device_map": "auto"})
output = generator(prompt, penalty_alpha=0.6, top_k=4,  max_length=256)

Gives me the error:

RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half'

So contrastive search seems not compatible with loading the model in 8-bit. Is that expected or a bug?

Feb 13 '23 09:02 apolinario

@sam-ulrich1 do you have some updates on your end? I can open a PR from the changes in your fork, if you're interested :)

Feb 13 '23 10:02 gante

@gante Shoot! Sorry man this slipped my mind. Let me take a look.at the PR guidelines again and see if I can get mine rebased and prepped and if not then I'm happy to let you.

Thanks man!

Feb 13 '23 12:02 sam-ulrich1

Just to flag, the error I faced here still exists with @sam-ulrich1's fix. Should I open a new Issue as this may be related specifically to 8-bit?

Feb 13 '23 21:02 apolinario

@gante I'm gonna look at this today. Sorry man, I've been slammed with work the past month

Feb 18 '23 10:02 sam-ulrich1

@gante If you want to just snag my changes go ahead otherwise I will eventually get to this it's just been a really tough few weeks

Feb 21 '23 23:02 sam-ulrich1

BTW I'm not sure if this fix is still needed, I am unable to reproduce the issue on main.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilgpt2", torch_dtype=torch.float16).to("cuda")

inputs = tok(["This cat is"], return_tensors="pt").to("cuda")
gen_out = model.generate(**inputs, top_k=4, penalty_alpha=0.6)

If someone else comes across this issue, please let me know 🙏

Feb 22 '23 14:02 gante

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Mar 18 '23 15:03 github-actions[bot]

transformers transformers copied to clipboard

Contrastive Search in .generate() function doesn't work with Half

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard