tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Bpe clones

Open sftse opened this issue 11 months ago • 4 comments

This PR tightens up the allocations by a bit. It's unlikely there are any juicy gains here, but the changes also happen to lead to terser, more clear code, so this might be an uncontroversial win.

sftse avatar Dec 28 '24 17:12 sftse

It looks like this touches the codepath that lead to the slowdown in #1564, I'll look into whether this is something easy to fix.

sftse avatar Dec 30 '24 05:12 sftse

Rebased on main to get the new clippy fixes, should fix CI run.

sftse avatar Jan 03 '25 11:01 sftse

It looks like this touches the codepath that lead to the slowdown in https://github.com/huggingface/tokenizers/issues/1564, I'll look into whether this is something easy to fix.

Hey! did you have anuy lead regarding this? 🤗

ArthurZucker avatar May 27 '25 09:05 ArthurZucker

It looks like this touches the codepath that lead to the slowdown in #1564, I'll look into whether this is something easy to fix.

Hey! did you have anuy lead regarding this? 🤗

I made a comment, the reported slowdown might be explainable with the increased number of allocations when going from String to Vec<String>. The supplied tokenizer is WordPiece which also hits the fn cleanup function causing lots of allocations because the long s.replace(..).replace(..).replace(..)... chain. However this is only a guess and I stopped looking further after confirming that this PR improves the regression by allocating slightly less.

sftse avatar May 27 '25 13:05 sftse