YouTokenToMe icon indicating copy to clipboard operation
YouTokenToMe copied to clipboard

No word tokenizer under the hood?

Open slowwavesleep opened this issue 3 years ago • 0 comments

Hi,

In the original BPE paper, as well as in the BPE dropout paper, the authors apply word-based tokenization (namely, the Moses tokenizer, as well as some others) before the main algorithm. However, this project's readme is somewhat vague regarding this detail. Do I understand it correctly that the only word-based tokenization implemented is basically splitting on spaces and that's it?

What confuses me is this quote: ours does not consider tokens that cross word boundaries. For some languages it's impossible not to consider tokens that cross word boundaries based on spaces alone. So my question as follows: is there a more sophisticated word-based tokenizer under the hood after all?

slowwavesleep avatar May 17 '21 13:05 slowwavesleep