Introducing special tokens via `tokenizers.normalizers.Replace`
@Narsil this is based on my comment to https://github.com/huggingface/tokenizers/issues/985
I'd like to introduce a special token <digits> to my vocabulary and replace any digits in the input (i.e. using the regex \d+ with the token <digits> ). The problem I ran into was the model kept splitting the text into < digits >
After a bit more fiddling the model below appears to be working, but it fails as soon as you uncomment the ByteLevel encoding and Decoding. Since I've already stripped out all punctuation and case folded (and my input is mostly ASCII - with the odd erroneous character that should be removed anyway) this isn't a problem - but I'm wondering why ByteLevel is causing an issue and is what I've done always going to work or is it just by chance.
import tokenizers
import string
BOS = "<s>"
EOS = "</s>"
UNK = "<unk>"
PAD = "<pad>"
MASK = "<mask>"
PAD = "<pad>"
DIGITS = "<digits>"
special_tokens=[BOS, PAD, EOS, UNK, MASK, DIGITS]
tokenizer = tokenizers.Tokenizer(tokenizers.models.BPE(unk_token = UNK))
tokenizer.normalizer = tokenizers.normalizers.Sequence([
tokenizers.normalizers.Nmt(),
tokenizers.normalizers.Replace(tokenizers.Regex(r"\p{Punct}"), " "),
tokenizers.normalizers.Replace(tokenizers.Regex(r"\d+"), DIGITS),
tokenizers.normalizers.Replace(tokenizers.Regex(r"(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])")," "),
tokenizers.normalizers.Lowercase()
])
tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Sequence([
tokenizers.pre_tokenizers.WhitespaceSplit(),
#tokenizers.pre_tokenizers.ByteLevel(add_prefix_space=False),
])
#tokenizer.decoder = tokenizers.decoders.ByteLevel()
trainer = tokenizers.trainers.BpeTrainer(
vocab_size=1000,
special_tokens=[UNK,DIGITS],
initial_alphabet=string.ascii_lowercase
)
tokenizer.train_from_iterator(['EX-01']*100, trainer=trainer)
If ByteLevel is used result of tokenizer.encode("EX-01").tokens is ['ex', 'Ä <', 'digits', '>'] ([32, 35, 37, 3]), without I get the desired result of ['ex', '
I think the issue is the docstring from ByteLevel states "This pre-tokenizer takes care of replacing all bytes of the given string with a corresponding representation, as well as splitting into words". I think it would be better if all it did was replace bytes and left splitting to another pre_tokenizer step.
Also I've noticed inconsistencies in the results, tokenizer.encode("EX-01") is correctly tokenized but tokenizer.encode("EX01") is not. I can probably deal with this but it's not ideal
Finally, looks like if you use Replace and the replacement isn't the same length as the match then the offsets aren't aligned with the original text - i.e. https://github.com/huggingface/tokenizers/issues/892 so my 3rd and 4th normalisation steps are problematic. This is a problem in NER where you need to return the covered text for example. I'll comment on 892 once I read it further.
as well as splitting into words". I think it would be better if all it did was replace bytes and left splitting to another pre_tokenizer step.
We're in luck, we've added that option : https://huggingface.co/docs/tokenizers/main/en/api/pre-tokenizers#tokenizers.pre_tokenizers.ByteLevel (use_regex=False). For context ByteLevel comes from gpt2, where the regex was also used. So it lingered here for a long time. But Bloom project needed ByteLevel without the regex which lead to this improvement, unfortunately we cannot change the default yet because of backward compatibility.
Using use_regex=False and using any other pre_tokenizer should work as you expect, can you confirm ?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.