tokenizers
tokenizers copied to clipboard
Support for regex in Replace normalization function
Hi,
I was wondering whether there exists regex support for the Replace Normalizer. It seems that Replace just replaces the strings verbatim right now.
Thanks! Stéphan
Hi, yes it does:
from tokenizer import Tokenizer
from tokenizers.normalizers import Replace
from tokenizers import Regex
from tokenizers.models import BPE
my_tokenizer = Tokenizer(BPE())
my_tokenizer.normalizer = Replace(Regex('[0-9]+'), '[NUM]')
Ok, that's great!
Thanks for the quick reply. Is this mentioned in the docs somewhere?
It is here : https://huggingface.co/docs/tokenizers/components#components
However there doesn't seem to be nice code examples for this in the docs. PR are welcome :)