LlamaForSequenceClassification forward method show different results with input_ids/inputs_embeds
System Info
transformers 4.44.0
Who can help?
@ArthurZucker
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [x] My own task or dataset (give details below)
Reproduction
llama_tokenizer = AutoTokenizer.from_pretrained("../Meta-Llama-3.2-1B-Instruct",padding_side="right")
llama_tokenizer.pad_token = "<|finetune_right_pad_id|>"
llama_model = LlamaForSequenceClassification.from_pretrained(
"../Meta-Llama-3.2-1B-Instruct",
num_labels=1,
torch_dtype=torch.bfloat16
)
class CustomEmbeddingModel_input_embeds(nn.Module):
def __init__(self, original_model,tokenizer):
super(CustomEmbeddingModel_input_embeds, self).__init__()
self.original_model = original_model
def forward(
self,
input_ids: Optional[torch.LongTensor] = None,
attention_mask: Optional[torch.FloatTensor] = None,
past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
):
if inputs_embeds is None:
inputs_embeds = self.model.model.embed_tokens(input_ids)
return self.original_model(
input_ids=None,
attention_mask=attention_mask,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
labels=labels,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
llama_model_input_embeds=CustomEmbeddingModel_input_embeds(llama_model,llama_tokenizer)
class CustomEmbeddingModel_input_ids(nn.Module):
def __init__(self, original_model,tokenizer):
super(CustomEmbeddingModel_input_ids, self).__init__()
self.original_model = original_model
def forward(
self,
input_ids: Optional[torch.LongTensor] = None,
attention_mask: Optional[torch.FloatTensor] = None,
past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
):
if inputs_embeds is None:
inputs_embeds = self.model.model.embed_tokens(input_ids)
return self.original_model(
input_ids=input_ids,
attention_mask=attention_mask,
past_key_values=past_key_values,
inputs_embeds=None
labels=labels,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
llama_model_input_ids=CustomEmbeddingModel_input_ids(llama_model,llama_tokenizer)
Expected behavior
https://github.com/huggingface/transformers/blob/3f06f95ebe617b192251ef756518690f5bc7ff76/src/transformers/models/llama/modeling_llama.py#L1314 sequence_lengths is only related to input_ids, when we use inputs_embeds instead, it will be default -1 however, the forward method of LlamaModel doesnt support the input of both input_ids and inputs_embeds
when i tryed to use inputs_embeds, ive used both input_ids and inputs_embeds while setting
transformer_outputs = self.model(
None,#input_ids
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
in the forward method of class LlamaForSequenceClassification
input_embeds not checking pad token
if self.config.pad_token_id is None:
sequence_lengths = -1
else:
if input_ids is not None:
# if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
sequence_lengths = sequence_lengths % input_ids.shape[-1]
sequence_lengths = sequence_lengths.to(logits.device)
else:
sequence_lengths = -1
According to the doc string of Llama's modeling file, checking if there's pad token embeds in input_embeds is not implemented due to padding token embed is unknown at this point.
Since it cannot guess the padding tokens when
inputs_embedsare passed instead ofinput_ids, it does the same (take the last value in each row of the batch).
However, I was wondering if it's possible to implement a comparison between the provided input_embeds and the embedding for the pad token (retrieved via pad_token_id), rather than relying on simply using the last value in each row. This would allow the model to explicitly identify pad token embeddings, even when input_embeds are used.
cc @ArthurZucker maybe
Hey! I think your are missing:
llama_model = LlamaForSequenceClassification.from_pretrained(
"../Meta-Llama-3.2-1B-Instruct",
num_labels=1,
torch_dtype=torch.bfloat16
+ pad_token_id=12000
)
with 12000 being the id (I invented it) no?
Hey! I think your are missing:
llama_model = LlamaForSequenceClassification.from_pretrained( "../Meta-Llama-3.2-1B-Instruct", num_labels=1, torch_dtype=torch.bfloat16 + pad_token_id=12000 )with 12000 being the id (I invented it) no?
Thanks, but i ve used
llama_tokenizer = AutoTokenizer.from_pretrained("../Meta-Llama-3.2-1B-Instruct",padding_side="right")
llama_tokenizer.pad_token = "<|finetune_right_pad_id|>"
llama_model = LlamaForSequenceClassification.from_pretrained(
"../Meta-Llama-3.2-1B-Instruct",
num_labels=1,
torch_dtype=torch.bfloat16
)
llama_model.config.pad_token_id = llama_tokenizer.pad_token_id
which i thought was the same as directly setting pad_token_id in from_pretrained() method.
but the embedding matrix is initialized before you set the config. pad token id
but the embedding matrix is initialized before you set the config. pad token id
Thank u so much, this really solved the problem with my code
Cool! Closing as fixed then! 🤗
input_embeds not checking pad token
if self.config.pad_token_id is None: sequence_lengths = -1 else: if input_ids is not None: # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1 sequence_lengths = sequence_lengths % input_ids.shape[-1] sequence_lengths = sequence_lengths.to(logits.device) else: sequence_lengths = -1According to the doc string of Llama's modeling file, checking if there's pad token embeds in
input_embedsis not implemented due to padding token embed is unknown at this point.Since it cannot guess the padding tokens when
inputs_embedsare passed instead ofinput_ids, it does the same (take the last value in each row of the batch).However, I was wondering if it's possible to implement a comparison between the provided
input_embedsand the embedding for the pad token (retrieved viapad_token_id), rather than relying on simply using the last value in each row. This would allow the model to explicitly identify pad token embeddings, even wheninput_embedsare used.
Hi Arthur @ArthurZucker , do you think it is necessary to implement checking if there is the embedding of pad_token when input_embeds is used as input and pad_token_id is set.
I'll take a look into it if it is worth doing.
Not sure it's worth it! 🤗