tokenizers Bpe clones

This PR tightens up the allocations by a bit. It's unlikely there are any juicy gains here, but the changes also happen to lead to terser, more clear code, so this might be an uncontroversial win.

Dec 28 '24 17:12 sftse

It looks like this touches the codepath that lead to the slowdown in #1564, I'll look into whether this is something easy to fix.

Dec 30 '24 05:12 sftse

Rebased on main to get the new clippy fixes, should fix CI run.

Jan 03 '25 11:01 sftse

It looks like this touches the codepath that lead to the slowdown in https://github.com/huggingface/tokenizers/issues/1564, I'll look into whether this is something easy to fix.

Hey! did you have anuy lead regarding this? 🤗

May 27 '25 09:05 ArthurZucker

It looks like this touches the codepath that lead to the slowdown in #1564, I'll look into whether this is something easy to fix.

Hey! did you have anuy lead regarding this? 🤗

I made a comment, the reported slowdown might be explainable with the increased number of allocations when going from String to Vec<String>. The supplied tokenizer is WordPiece which also hits the fn cleanup function causing lots of allocations because the long s.replace(..).replace(..).replace(..)... chain. However this is only a guess and I stopped looking further after confirming that this PR improves the regression by allocating slightly less.

May 27 '25 13:05 sftse