transformers LlamaForSequenceClassification forward method show different results with input_ids/inputs

System Info

transformers 4.44.0

Who can help?

@ArthurZucker

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

llama_tokenizer = AutoTokenizer.from_pretrained("../Meta-Llama-3.2-1B-Instruct",padding_side="right")

llama_tokenizer.pad_token = "<|finetune_right_pad_id|>"

llama_model = LlamaForSequenceClassification.from_pretrained(
    "../Meta-Llama-3.2-1B-Instruct",
    num_labels=1,
    torch_dtype=torch.bfloat16
)
class CustomEmbeddingModel_input_embeds(nn.Module):
    def __init__(self, original_model,tokenizer):
        super(CustomEmbeddingModel_input_embeds, self).__init__()    
        self.original_model = original_model
    def forward(
        self,
        input_ids: Optional[torch.LongTensor] = None,
        attention_mask: Optional[torch.FloatTensor] = None,
        past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        labels: Optional[torch.LongTensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ):
        if inputs_embeds is None:
            inputs_embeds = self.model.model.embed_tokens(input_ids)
        return self.original_model(
            input_ids=None,
            attention_mask=attention_mask,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            labels=labels,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
llama_model_input_embeds=CustomEmbeddingModel_input_embeds(llama_model,llama_tokenizer)

class CustomEmbeddingModel_input_ids(nn.Module):
    def __init__(self, original_model,tokenizer):
        super(CustomEmbeddingModel_input_ids, self).__init__()    
        self.original_model = original_model
    def forward(
        self,
        input_ids: Optional[torch.LongTensor] = None,
        attention_mask: Optional[torch.FloatTensor] = None,
        past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        labels: Optional[torch.LongTensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ):
        if inputs_embeds is None:
            inputs_embeds = self.model.model.embed_tokens(input_ids)
        return self.original_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            past_key_values=past_key_values,
            inputs_embeds=None
            labels=labels,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
llama_model_input_ids=CustomEmbeddingModel_input_ids(llama_model,llama_tokenizer)

Expected behavior

https://github.com/huggingface/transformers/blob/3f06f95ebe617b192251ef756518690f5bc7ff76/src/transformers/models/llama/modeling_llama.py#L1314 sequence_lengths is only related to input_ids, when we use inputs_embeds instead, it will be default -1 however, the forward method of LlamaModel doesnt support the input of both input_ids and inputs_embeds

Oct 17 '24 11:10 ChitandaErumanga

when i tryed to use inputs_embeds, ive used both input_ids and inputs_embeds while setting

        transformer_outputs = self.model(
            None,#input_ids
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

in the forward method of class LlamaForSequenceClassification

Oct 17 '24 12:10 ChitandaErumanga

input_embeds not checking pad token

if self.config.pad_token_id is None:
            sequence_lengths = -1
        else:
            if input_ids is not None:
                # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
                sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
                sequence_lengths = sequence_lengths % input_ids.shape[-1]
                sequence_lengths = sequence_lengths.to(logits.device)
            else:
                sequence_lengths = -1

According to the doc string of Llama's modeling file, checking if there's pad token embeds in input_embeds is not implemented due to padding token embed is unknown at this point.

Since it cannot guess the padding tokens when inputs_embeds are passed instead of input_ids, it does the same (take the last value in each row of the batch).

However, I was wondering if it's possible to implement a comparison between the provided input_embeds and the embedding for the pad token (retrieved via pad_token_id), rather than relying on simply using the last value in each row. This would allow the model to explicitly identify pad token embeddings, even when input_embeds are used.

Oct 17 '24 12:10 Wangmerlyn

cc @ArthurZucker maybe

Oct 17 '24 15:10 LysandreJik

Hey! I think your are missing:

llama_model = LlamaForSequenceClassification.from_pretrained(
    "../Meta-Llama-3.2-1B-Instruct",
    num_labels=1,
    torch_dtype=torch.bfloat16
+    pad_token_id=12000
)

with 12000 being the id (I invented it) no?

Oct 22 '24 14:10 ArthurZucker

Hey! I think your are missing:

llama_model = LlamaForSequenceClassification.from_pretrained(
    "../Meta-Llama-3.2-1B-Instruct",
    num_labels=1,
    torch_dtype=torch.bfloat16
+    pad_token_id=12000
)

with 12000 being the id (I invented it) no?

Thanks, but i ve used

llama_tokenizer = AutoTokenizer.from_pretrained("../Meta-Llama-3.2-1B-Instruct",padding_side="right")
llama_tokenizer.pad_token = "<|finetune_right_pad_id|>"

llama_model = LlamaForSequenceClassification.from_pretrained(
    "../Meta-Llama-3.2-1B-Instruct",
    num_labels=1,
    torch_dtype=torch.bfloat16
)

llama_model.config.pad_token_id = llama_tokenizer.pad_token_id

which i thought was the same as directly setting pad_token_id in from_pretrained() method.

Oct 22 '24 18:10 ChitandaErumanga

but the embedding matrix is initialized before you set the config. pad token id

Oct 24 '24 15:10 ArthurZucker

but the embedding matrix is initialized before you set the config. pad token id

Thank u so much, this really solved the problem with my code

Oct 26 '24 16:10 ChitandaErumanga

Cool! Closing as fixed then! 🤗

Oct 29 '24 13:10 ArthurZucker

input_embeds not checking pad token
if self.config.pad_token_id is None:
            sequence_lengths = -1
        else:
            if input_ids is not None:
                # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
                sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
                sequence_lengths = sequence_lengths % input_ids.shape[-1]
                sequence_lengths = sequence_lengths.to(logits.device)
            else:
                sequence_lengths = -1
According to the doc string of Llama's modeling file, checking if there's pad token embeds in input_embeds is not implemented due to padding token embed is unknown at this point.

Since it cannot guess the padding tokens when inputs_embeds are passed instead of input_ids, it does the same (take the last value in each row of the batch).

However, I was wondering if it's possible to implement a comparison between the provided input_embeds and the embedding for the pad token (retrieved via pad_token_id), rather than relying on simply using the last value in each row. This would allow the model to explicitly identify pad token embeddings, even when input_embeds are used.

Hi Arthur @ArthurZucker , do you think it is necessary to implement checking if there is the embedding of pad_token when input_embeds is used as input and pad_token_id is set. I'll take a look into it if it is worth doing.

Oct 30 '24 04:10 Wangmerlyn

Not sure it's worth it! 🤗

Nov 25 '24 10:11 ArthurZucker

LlamaForSequenceClassification forward method show different results with input_ids/inputs_embeds

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior