tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

feat: support custom regexes for GPT pre-tokenizer

Open gcampax opened this issue 1 year ago • 5 comments

In order to properly support GPT3.5/GPT4 models which changed the regex compared to GPT2

Existing tokenizer files are not affected. New tokenizer files can be created that copy the new regex from the tiktoken sources.

I'm new to Rust, so the code probably doesn't look very idiomatic. Happy to adjust it.

gcampax avatar Feb 28 '24 17:02 gcampax

There's no need for that, just use use_regex:false and use a regex externally (as another pre-tokenizer for instance) if you want.

Narsil avatar Mar 11 '24 10:03 Narsil

I'm sorry, but this is not correct. use_regex false does not apply in this context: you do need to apply the regular expression. Applying the regex externally also doesn't apply: token splitting by regex is done inside the preprocessor. I don't know how you would apply it externally. It's also quite inconvenient to have special case external logic for OpenAI tokenizers, instead of being able to specify a JSON file that actually works.

The PR is not that big, I would kindly ask you to reconsider.

gcampax avatar Mar 11 '24 10:03 gcampax

Hey! I am down to re-consider, I think that Narsil meant is that you can have a sequence of pre_processor, to first do the regex then apply bytelevel (this would be "external"). But to be honest, since there is a regex inside, why waste it!

ArthurZucker avatar Jun 11 '24 12:06 ArthurZucker

@gcampax do you want to rebase and I'll review?

ArthurZucker avatar Jun 11 '24 12:06 ArthurZucker

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.