Bpe clones
This PR tightens up the allocations by a bit. It's unlikely there are any juicy gains here, but the changes also happen to lead to terser, more clear code, so this might be an uncontroversial win.
It looks like this touches the codepath that lead to the slowdown in #1564, I'll look into whether this is something easy to fix.
Rebased on main to get the new clippy fixes, should fix CI run.
It looks like this touches the codepath that lead to the slowdown in https://github.com/huggingface/tokenizers/issues/1564, I'll look into whether this is something easy to fix.
Hey! did you have anuy lead regarding this? 🤗
It looks like this touches the codepath that lead to the slowdown in #1564, I'll look into whether this is something easy to fix.
Hey! did you have anuy lead regarding this? 🤗
I made a comment, the reported slowdown might be explainable with the increased number of allocations when going from String to Vec<String>. The supplied tokenizer is WordPiece which also hits the fn cleanup function causing lots of allocations because the long s.replace(..).replace(..).replace(..)... chain. However this is only a guess and I stopped looking further after confirming that this PR improves the regression by allocating slightly less.