transformers
transformers copied to clipboard
Batch Decoding in GPT2 with variable length sequences
System Info
transformersversion: 4.25.1- Platform: Linux-4.18.0-372.26.1.el8_6.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.11
- Huggingface_hub version: 0.11.1
- PyTorch version (GPU?): 1.12.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
@younesbelkada
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [x] My own task or dataset (give details below)
Reproduction
Hi, I am trying to batch decode using GPT2. Each batch may contain variable sequences with different length. I did try specifying left padding and explicitly setting the pad_token in GPT2.
Steps to reproduce the error
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
tokenizer = AutoTokenizer.from_pretrained('gpt2')
# run this only for gpt-2 as we do not have a pad token in gpt2
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'
model = AutoModelForCausalLM.from_pretrained('gpt2', pad_token_id = tokenizer.eos_token_id)
model.to(device)
sentence = "I went to the"
results = tokenizer(
[sentence],
add_special_tokens=True,
truncation=True,
padding=True,
return_tensors='pt',
)
print("========= With No Padding ==========")
print("Tokenizing the input sentence \"{0}\" leads to ".format(sentence) )
print(tokenizer.convert_ids_to_tokens( results['input_ids'][0] ))
with torch.no_grad():
logits = model(results['input_ids'].to(device),
attention_mask=results['attention_mask'].to(device),
).logits[:, -1, :]
index = torch.argmax(logits).item()
print( sentence + " " + tokenizer.convert_ids_to_tokens(index) )
print("\n" * 2)
max_length= 30
print("========= Using Padding of size {0} ==========".format(max_length))
results = tokenizer(
[sentence],
add_special_tokens=True,
max_length=max_length,
truncation=False,
padding='max_length',
return_tensors='pt',
)
print("Tokenizing the padded input sentence \"{0}\" leads to ".format(sentence) )
print(tokenizer.convert_ids_to_tokens( results['input_ids'][0] ))
with torch.no_grad():
logits = model(results['input_ids'].to(device),
attention_mask=results['attention_mask'].to(device),
).logits[:, -1, :]
index = torch.argmax(logits).item()
print( sentence + " " + tokenizer.convert_ids_to_tokens(index) )
print("\n" * 2)
Output
========= With No Padding ==========
Tokenizing the input sentence "I went to the" leads to
['I', 'Ä went', 'Ä to', 'Ä the']
I went to the Ä hospital
========= Using Padding of size 30 ==========
Tokenizing the padded input sentence "I went to the" leads to
['<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', 'I', 'Ä went', 'Ä to', 'Ä the']
I went to the Ä the
Explicitly, Modifying position embeddings takes care of the above problem.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
tokenizer = AutoTokenizer.from_pretrained('gpt2')
# run this only for gpt-2 as we do not have a pad token in gpt2
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'
model = AutoModelForCausalLM.from_pretrained('gpt2', pad_token_id = tokenizer.eos_token_id)
model.to(device)
sentence = "I went to the"
results = tokenizer(
[sentence],
add_special_tokens=True,
truncation=True,
padding=True,
return_tensors='pt',
)
position_ids = torch.zeros(results['attention_mask'].size(), dtype=torch.int32)
starting_index = 0
for index in range(results['attention_mask'][0].size(0)):
if results['attention_mask'][0][index] == 1:
position_ids[0][index] = starting_index
starting_index += 1
print("========= With No Padding ==========")
print("Tokenizing the input sentence \"{0}\" leads to ".format(sentence) )
print(tokenizer.convert_ids_to_tokens( results['input_ids'][0] ))
with torch.no_grad():
logits = model(results['input_ids'].to(device),
attention_mask=results['attention_mask'].to(device),
position_ids=position_ids.to(device),
).logits[:, -1, :]
index = torch.argmax(logits).item()
print( sentence + " " + tokenizer.convert_ids_to_tokens(index) )
print("\n" * 2)
max_length= 30
print("========= Using Padding of size {0} ==========".format(max_length))
results = tokenizer(
[sentence],
add_special_tokens=True,
max_length=max_length,
truncation=False,
padding='max_length',
return_tensors='pt',
)
print("Tokenizing the padded input sentence \"{0}\" leads to ".format(sentence) )
print(tokenizer.convert_ids_to_tokens( results['input_ids'][0] ))
position_ids = torch.zeros(results['attention_mask'].size(), dtype=torch.int32)
starting_index = 0
for index in range(results['attention_mask'][0].size(0)):
if results['attention_mask'][0][index] == 1:
position_ids[0][index] = starting_index
starting_index += 1
with torch.no_grad():
logits = model(results['input_ids'].to(device),
attention_mask=results['attention_mask'].to(device),
position_ids=position_ids.to(device),
).logits[:, -1, :]
index = torch.argmax(logits).item()
print( sentence + " " + tokenizer.convert_ids_to_tokens(index) )
print("\n" * 2)
The output when position embeddings are explicitly specified:
========= With No Padding ==========
Tokenizing the input sentence "I went to the" leads to
['I', 'Ä went', 'Ä to', 'Ä the']
I went to the Ä hospital
========= Using Padding of size 30 ==========
Tokenizing the padded input sentence "I went to the" leads to
['<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', 'I', 'Ä went', 'Ä to', 'Ä the']
I went to the Ä hospital
Is it possible to have documentation mentioning this?
Expected behavior
In both scenarios, with and without left padding the model should generate Ä hospital as the token with the highest probability. However, without modifying position embedding and when tokens are padded we get Ä the as the next token with the highest probability
cc @ArthurZucker
@younesbelkada related issue that we had closed before: https://github.com/huggingface/transformers/issues/18809
Before diving a bit deeper, I don't really understand why are you using convert_id_to_tokens instead of juste using the tokenizer.batch_decode method? Did you try with it?
Before diving a bit deeper, I don't really understand why are you using
convert_id_to_tokensinstead of juste using thetokenizer.batch_decodemethod? Did you try with it?
Hi @ArthurZucker , the issues is not with convert_id_to_tokens . If we replace this function convert_id_to_tokens with tokenizer.batch_decode we still get the same issue.
The issue being GPT2 model adds position embeddings to every token in the input sequence including pad_tokens.
Consider the input has I went to the. If we use batch size of 1 and no padding is specified, the position id for the word I will be 0. However, if I specify the max_length as say 5 in the tokenizer. The tokenizer prepends the input with one pad_token. As a result, the position id for the word I will be 1. This changes the model prediction
There seems to indeed be a bug! When I use the generate() function, I am getting the correct output :
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('gpt2')
>>> tokenizer.pad_token = tokenizer.eos_token
>>> tokenizer.pad_token_id = tokenizer.eos_token_id
>>> tokenizer.padding_side = 'left'
>>> model = AutoModelForCausalLM.from_pretrained('gpt2', pad_token_id = tokenizer.eos_token_id)
>>> prompt_text = [ 'I went to the','we are trying to','The purpose of this workshop is to check whether we can']
>>> encodings_dict = tokenizer.batch_encode_plus(prompt_text, max_length=12, pad_to_max_length=True, return_tensors= "pt")
>>> input_ids = torch.tensor(encodings_dict['input_ids'])
>>> attn_mask = torch.tensor(encodings_dict['attention_mask'])
>>> tokenizer.batch_decode(model.generate(input_ids, attention_mask=attn_mask, max_length=12))
['<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>I went to the hospital',
'<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>we are trying to get',
'<|endoftext|>The purpose of this workshop is to check whether we can make']
The issue lies with the fact that we have to pass the positions ids for gpt2. In the generate function, the positional ids are created on the fly if not passed, which is why we have the correct output.
if attention_mask is not None and position_ids is None:
# create position_ids on the fly for batch generation
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
if past:
position_ids = position_ids[:, -1].unsqueeze(-1)
cc @LysandreJik I am guessing that the original implementation does not use this? Or is there a specific reason that we are using
if position_ids is None:
position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)
position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])
in the model's forward?
Thanks for the great issue @murthyrudra!
Hmmm indeed, might be a bug dating back to the original implementation of gpt2 within transformers (this codes dates back to Feb 2019). It's going to be a bit hard to change this within the code, but we can update the documentation/show pointers regarding how to circumvent this issue.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.