tokenizers feat: support custom regexes for GPT pre-tokenizer

In order to properly support GPT3.5/GPT4 models which changed the regex compared to GPT2

Existing tokenizer files are not affected. New tokenizer files can be created that copy the new regex from the tiktoken sources.

I'm new to Rust, so the code probably doesn't look very idiomatic. Happy to adjust it.

Feb 28 '24 17:02 gcampax

There's no need for that, just use use_regex:false and use a regex externally (as another pre-tokenizer for instance) if you want.

Mar 11 '24 10:03 Narsil

I'm sorry, but this is not correct. use_regex false does not apply in this context: you do need to apply the regular expression. Applying the regex externally also doesn't apply: token splitting by regex is done inside the preprocessor. I don't know how you would apply it externally. It's also quite inconvenient to have special case external logic for OpenAI tokenizers, instead of being able to specify a JSON file that actually works.

The PR is not that big, I would kindly ask you to reconsider.

Mar 11 '24 10:03 gcampax

Hey! I am down to re-consider, I think that Narsil meant is that you can have a sequence of pre_processor, to first do the regex then apply bytelevel (this would be "external"). But to be honest, since there is a regex inside, why waste it!

Jun 11 '24 12:06 ArthurZucker

@gcampax do you want to rebase and I'll review?

Jun 11 '24 12:06 ArthurZucker

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Jun 11 '24 12:06 HuggingFaceDocBuilderDev