tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Support for regex in Replace normalization function

Open stephantul opened this issue 3 years ago • 3 comments

Hi,

I was wondering whether there exists regex support for the Replace Normalizer. It seems that Replace just replaces the strings verbatim right now.

Thanks! Stéphan

stephantul avatar Jul 27 '22 10:07 stephantul

Hi, yes it does:

from tokenizer import Tokenizer
from tokenizers.normalizers import Replace
from tokenizers import Regex
from tokenizers.models import BPE

my_tokenizer = Tokenizer(BPE())
my_tokenizer.normalizer = Replace(Regex('[0-9]+'), '[NUM]')

sadra-barikbin avatar Jul 27 '22 11:07 sadra-barikbin

Ok, that's great!

Thanks for the quick reply. Is this mentioned in the docs somewhere?

stephantul avatar Jul 27 '22 12:07 stephantul

It is here : https://huggingface.co/docs/tokenizers/components#components

However there doesn't seem to be nice code examples for this in the docs. PR are welcome :)

Narsil avatar Jul 27 '22 15:07 Narsil