transformers
transformers copied to clipboard
Llama Tokenizer uses incorrect indices for PAD
System Info
latest transformer main
Who can help?
@gante
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
configuration_llama.py sets
pad_token_id=0,
bos_token_id=1,
eos_token_id=2
but this is wrong. After checking the original tokenizer from FB,
sp_model = SentencePieceProcessor(model_file="/home/ubuntu/llama/tokenizer.model")
print("bos_id: ", sp_model.bos_id())
print("eos_id: ", sp_model.eos_id())
print("pad_id: ", sp_model.pad_id())
we see that
bos_id: 1
eos_id: 2
pad_id: -1
Expected behavior
bos_id: 1
eos_id: 2
pad_id: -1
instead of
pad_token_id=0,
bos_token_id=1,
eos_token_id=2
cc @ArthurZucker -- is this fixed in #22402 ?
This is probably not gonna be fixed with regard to the configuration_llama
.
However note that having the sp_model
sending -1
as a pad token means that it does not have any indices. Llama does not use a padding token.
The fix that we provide is that in the tokenization_llama
the pad_token
is set to None
.
The config should be fixed to ensure that pad_token=None
rather than pad_token = 0
Thanks @gante @ArthurZucker
What about the mismatch between the eos and bos tokens? Or is HF's tokenizer zero-indexed while Meta's native tokenizer is one-indexed?
As you said and showed using the sp_model
,
bos_id: 1 eos_id: 2
If you instantiate the tokenizer using this for example, it has the same ids so I am not sure I follow the problem?
@ArthurZucker if I want to batching then I have to manually add a pad_token. In this case how do I ensure that the pad_token_id is actually correct? I.e how do I get the tokenizer to set pad_tokens to 0 instead of 32000 that I am getting now, by using add_special_tokens like add_special_tokens({'pad_token': '[PAD]'})
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
The problem is that 0
is already the unk
token. The easiest way is to set the pad token to the unk token.