transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Llama Tokenizer uses incorrect indices for PAD

Open michaelroyzen opened this issue 1 year ago • 5 comments

System Info

latest transformer main

Who can help?

@gante

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

configuration_llama.py sets

pad_token_id=0,
bos_token_id=1,
eos_token_id=2

but this is wrong. After checking the original tokenizer from FB,

sp_model = SentencePieceProcessor(model_file="/home/ubuntu/llama/tokenizer.model")

print("bos_id: ", sp_model.bos_id())
print("eos_id: ", sp_model.eos_id())
print("pad_id: ", sp_model.pad_id())

we see that

bos_id:  1
eos_id:  2
pad_id:  -1

Expected behavior

bos_id:  1
eos_id:  2
pad_id:  -1

instead of

pad_token_id=0,
bos_token_id=1,
eos_token_id=2

michaelroyzen avatar Apr 03 '23 00:04 michaelroyzen

cc @ArthurZucker -- is this fixed in #22402 ?

gante avatar Apr 03 '23 12:04 gante

This is probably not gonna be fixed with regard to the configuration_llama. However note that having the sp_model sending -1 as a pad token means that it does not have any indices. Llama does not use a padding token. The fix that we provide is that in the tokenization_llama the pad_token is set to None.

The config should be fixed to ensure that pad_token=None rather than pad_token = 0

ArthurZucker avatar Apr 03 '23 13:04 ArthurZucker

Thanks @gante @ArthurZucker

What about the mismatch between the eos and bos tokens? Or is HF's tokenizer zero-indexed while Meta's native tokenizer is one-indexed?

michaelroyzen avatar Apr 04 '23 22:04 michaelroyzen

As you said and showed using the sp_model,

bos_id: 1 eos_id: 2

If you instantiate the tokenizer using this for example, it has the same ids so I am not sure I follow the problem?

ArthurZucker avatar Apr 05 '23 08:04 ArthurZucker

@ArthurZucker if I want to batching then I have to manually add a pad_token. In this case how do I ensure that the pad_token_id is actually correct? I.e how do I get the tokenizer to set pad_tokens to 0 instead of 32000 that I am getting now, by using add_special_tokens like add_special_tokens({'pad_token': '[PAD]'})

vikigenius avatar Apr 28 '23 02:04 vikigenius

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 22 '23 15:05 github-actions[bot]

The problem is that 0 is already the unk token. The easiest way is to set the pad token to the unk token.

ArthurZucker avatar May 24 '23 11:05 ArthurZucker