tokenizers Allow users to select/write encoding strategies

Allow users to select/write encoding strategies

Open pietrolesci opened this issue 4 months ago • 2 comments

Hi there,

Do you plan to add the possibility to control how tokenizers behave at inference time?

For example, adding the possibility for the user to decide whether to use standard BPE (merges) or, e.g., the longest prefix encoding strategy. See Greed is All You Need: An Evaluation of Tokenizer Inference Methods for why this can be useful.

Thanks in advance for your time!

Best, Pietro

Example. Consider a BPE tokenizer with merges M = {yu, yum, my} and initial alphabet A = {y, u, m}. Given the string s = yummy, the standard BPE merge-based strategy tokenizes s as yu | m | my while BPE with the longest prefix encoding strategy tokenizes s as yum | my.

Oct 16 '24 10:10 pietrolesci

tokenizers tokenizers copied to clipboard

Allow users to select/write encoding strategies

tokenizers
tokenizers copied to clipboard