transformers Batch Decoding in GPT2 with variable length sequences

System Info

transformers version: 4.25.1
Platform: Linux-4.18.0-372.26.1.el8_6.x86_64-x86_64-with-glibc2.17
Python version: 3.8.11
Huggingface_hub version: 0.11.1
PyTorch version (GPU?): 1.12.1 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help?

@younesbelkada

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

Hi, I am trying to batch decode using GPT2. Each batch may contain variable sequences with different length. I did try specifying left padding and explicitly setting the pad_token in GPT2.

Steps to reproduce the error

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

tokenizer = AutoTokenizer.from_pretrained('gpt2')

# run this only for gpt-2 as we do not have a pad token in gpt2
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'

model = AutoModelForCausalLM.from_pretrained('gpt2', pad_token_id = tokenizer.eos_token_id)
model.to(device)

sentence = "I went to the"

results = tokenizer(
	[sentence],
	add_special_tokens=True,
	truncation=True,
	padding=True,
	return_tensors='pt',
)

print("========= With No Padding ==========")

print("Tokenizing the input sentence \"{0}\" leads to ".format(sentence) )
print(tokenizer.convert_ids_to_tokens( results['input_ids'][0] ))

with torch.no_grad():
	logits = model(results['input_ids'].to(device), 
					attention_mask=results['attention_mask'].to(device),
					).logits[:, -1, :]
	index = torch.argmax(logits).item()
	print( sentence + " " +  tokenizer.convert_ids_to_tokens(index) )

print("\n" * 2)

max_length= 30
print("========= Using Padding of size {0} ==========".format(max_length))

results = tokenizer(
    [sentence],
    add_special_tokens=True,
    max_length=max_length,
    truncation=False,
    padding='max_length',
    return_tensors='pt',
)

print("Tokenizing the padded input sentence \"{0}\" leads to ".format(sentence) )
print(tokenizer.convert_ids_to_tokens( results['input_ids'][0] ))

with torch.no_grad():
    logits = model(results['input_ids'].to(device), 
                    attention_mask=results['attention_mask'].to(device),
                    ).logits[:, -1, :]
    index = torch.argmax(logits).item()
    print( sentence + " " +  tokenizer.convert_ids_to_tokens(index) )

print("\n" * 2)

Output

========= With No Padding ==========
Tokenizing the input sentence "I went to the" leads to 
['I', 'Ġwent', 'Ġto', 'Ġthe']
I went to the Ġhospital



========= Using Padding of size 30 ==========
Tokenizing the padded input sentence "I went to the" leads to 
['<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', 'I', 'Ġwent', 'Ġto', 'Ġthe']
I went to the Ġthe

Explicitly, Modifying position embeddings takes care of the above problem.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

tokenizer = AutoTokenizer.from_pretrained('gpt2')

# run this only for gpt-2 as we do not have a pad token in gpt2
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'

model = AutoModelForCausalLM.from_pretrained('gpt2', pad_token_id = tokenizer.eos_token_id)
model.to(device)

sentence = "I went to the"

results = tokenizer(
	[sentence],
	add_special_tokens=True,
	truncation=True,
	padding=True,
	return_tensors='pt',
)

position_ids = torch.zeros(results['attention_mask'].size(), dtype=torch.int32)
starting_index = 0
for index in range(results['attention_mask'][0].size(0)):
    if results['attention_mask'][0][index] == 1:
        position_ids[0][index] = starting_index
        starting_index += 1

print("========= With No Padding ==========")

print("Tokenizing the input sentence \"{0}\" leads to ".format(sentence) )
print(tokenizer.convert_ids_to_tokens( results['input_ids'][0] ))

with torch.no_grad():
	logits = model(results['input_ids'].to(device), 
					attention_mask=results['attention_mask'].to(device),
                    position_ids=position_ids.to(device),
					).logits[:, -1, :]
	index = torch.argmax(logits).item()
	print( sentence + " " +  tokenizer.convert_ids_to_tokens(index) )

print("\n" * 2)

max_length= 30
print("========= Using Padding of size {0} ==========".format(max_length))

results = tokenizer(
    [sentence],
    add_special_tokens=True,
    max_length=max_length,
    truncation=False,
    padding='max_length',
    return_tensors='pt',
)

print("Tokenizing the padded input sentence \"{0}\" leads to ".format(sentence) )
print(tokenizer.convert_ids_to_tokens( results['input_ids'][0] ))

position_ids = torch.zeros(results['attention_mask'].size(), dtype=torch.int32)
starting_index = 0
for index in range(results['attention_mask'][0].size(0)):
    if results['attention_mask'][0][index] == 1:
        position_ids[0][index] = starting_index
        starting_index += 1

with torch.no_grad():
    logits = model(results['input_ids'].to(device), 
                    attention_mask=results['attention_mask'].to(device),
                    position_ids=position_ids.to(device),
                    ).logits[:, -1, :]
    index = torch.argmax(logits).item()
    print( sentence + " " +  tokenizer.convert_ids_to_tokens(index) )

print("\n" * 2)

The output when position embeddings are explicitly specified:

========= With No Padding ==========
Tokenizing the input sentence "I went to the" leads to 
['I', 'Ġwent', 'Ġto', 'Ġthe']
I went to the Ġhospital



========= Using Padding of size 30 ==========
Tokenizing the padded input sentence "I went to the" leads to 
['<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', 'I', 'Ġwent', 'Ġto', 'Ġthe']
I went to the Ġhospital

Is it possible to have documentation mentioning this?

Expected behavior

In both scenarios, with and without left padding the model should generate Ġhospital as the token with the highest probability. However, without modifying position embedding and when tokens are padded we get Ġthe as the next token with the highest probability

Jan 11 '23 03:01 murthyrudra

cc @ArthurZucker

Jan 11 '23 13:01 sgugger

@younesbelkada related issue that we had closed before: https://github.com/huggingface/transformers/issues/18809

Jan 12 '23 16:01 mayank31398

Before diving a bit deeper, I don't really understand why are you using convert_id_to_tokens instead of juste using the tokenizer.batch_decode method? Did you try with it?

Jan 13 '23 08:01 ArthurZucker

Before diving a bit deeper, I don't really understand why are you using convert_id_to_tokens instead of juste using the tokenizer.batch_decode method? Did you try with it?

Hi @ArthurZucker , the issues is not with convert_id_to_tokens . If we replace this function convert_id_to_tokens with tokenizer.batch_decode we still get the same issue.

The issue being GPT2 model adds position embeddings to every token in the input sequence including pad_tokens.

Consider the input has I went to the. If we use batch size of 1 and no padding is specified, the position id for the word I will be 0. However, if I specify the max_length as say 5 in the tokenizer. The tokenizer prepends the input with one pad_token. As a result, the position id for the word I will be 1. This changes the model prediction

Jan 13 '23 09:01 murthyrudra

There seems to indeed be a bug! When I use the generate() function, I am getting the correct output :

>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('gpt2')
>>> tokenizer.pad_token = tokenizer.eos_token
>>> tokenizer.pad_token_id = tokenizer.eos_token_id
>>> tokenizer.padding_side = 'left'

>>> model = AutoModelForCausalLM.from_pretrained('gpt2', pad_token_id = tokenizer.eos_token_id)

>>> prompt_text = [ 'I went to the','we are trying to','The purpose of this workshop is to check whether we can']
>>> encodings_dict = tokenizer.batch_encode_plus(prompt_text, max_length=12, pad_to_max_length=True, return_tensors= "pt")
>>> input_ids = torch.tensor(encodings_dict['input_ids'])
>>> attn_mask = torch.tensor(encodings_dict['attention_mask'])
>>> tokenizer.batch_decode(model.generate(input_ids, attention_mask=attn_mask, max_length=12))

['<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>I went to the hospital',
 '<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>we are trying to get',
 '<|endoftext|>The purpose of this workshop is to check whether we can make']

The issue lies with the fact that we have to pass the positions ids for gpt2. In the generate function, the positional ids are created on the fly if not passed, which is why we have the correct output.

        if attention_mask is not None and position_ids is None:
            # create position_ids on the fly for batch generation
            position_ids = attention_mask.long().cumsum(-1) - 1
            position_ids.masked_fill_(attention_mask == 0, 1)
            if past:
                position_ids = position_ids[:, -1].unsqueeze(-1)

cc @LysandreJik I am guessing that the original implementation does not use this? Or is there a specific reason that we are using

        if position_ids is None:
            position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)
            position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])

in the model's forward?

Jan 17 '23 14:01 ArthurZucker

Thanks for the great issue @murthyrudra!

Hmmm indeed, might be a bug dating back to the original implementation of gpt2 within transformers (this codes dates back to Feb 2019). It's going to be a bit hard to change this within the code, but we can update the documentation/show pointers regarding how to circumvent this issue.

Jan 17 '23 15:01 LysandreJik

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 20 '23 15:04 github-actions[bot]

transformers transformers copied to clipboard

Batch Decoding in GPT2 with variable length sequences

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard