`end_of_word_suffix` is ignored
Parameter end_of_word_suffix seems to be ignored by BPE.
Because of this, decoding does not put whitespaces between words
repoducer
from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.decoders import BPEDecoder
tokenizer = Tokenizer(BPE(end_of_word_suffix="</w>"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(vocab_size=100)
tokenizer.train_from_iterator(['Wel come to the 🤗 Tok en izers libr ary.'], trainer)
output = tokenizer.encode("Welcome to the 🤗 Tokenizers library.")
print(output.tokens)
# Result: ['Wel', 'come', 'to', 'the', '🤗', 'Tok', 'en', 'izers', 'libr', 'ary', '.']
# Expected: ['Wel', 'come</w>', 'to</w>', 'the</w>', '🤗</w>', 'Tok', 'en', 'izers</w>', 'libr', 'ary</w>', '.</w>']
print(tokenizer.decode(output.ids))
# Wel come to the 🤗 Tok en izers libr ary .
tokenizer.decoder = BPEDecoder(suffix="</w>")
print(tokenizer.decode(output.ids))
# Result: Welcometothe🤗Tokenizerslibrary.
# Expected: Welcome to the 🤗 Tokenizers library.
print(tokenizer.model.end_of_word_suffix)
# Result: None
# Expected: </w>
You should specify end_of_word_suffix in the BpeTrainer rather than BPE.
For example, this works as you expect if you use trainer = BpeTrainer(vocab_size=100, end_of_word_suffix="</w>"), and it even works if you have tokenizer = Tokenizer(BPE()).
Note: as a user, I also found this confusing in the past. BPE inherits the specification from BpeTrainer and it can even overwrite it. For example, if you have:
tokenizer = Tokenizer(BPE(end_of_word_suffix="!"))
trainer = BpeTrainer(vocab_size=100, end_of_word_suffix="</w>")
it will use </w>.
I do not have a good usecase for end_of_word_suffix to be specified in BPE only (without a trainer or anything), and the closest case I can think of is something where you import a tokenizer from a config file, but that would include the end_of_word_suffix information anyway.
It seems like there should at least be some warnings about overwriting this parameter and continuing_subword_prefix when using a BpeTrainer.
Happy to review a pr to make this more understandable! Thanks @mcognetta for the answer!
I'll prepare a PR, but I am just curious if you know of the places (if any) that end_of_word_suffix/continuing_subword_prefix is used outside of the trainer interfaces? I don't recall any but my last time digging into it was ~2 years ago haha.