YouTokenToMe
YouTokenToMe copied to clipboard
No word tokenizer under the hood?
Hi,
In the original BPE paper, as well as in the BPE dropout paper, the authors apply word-based tokenization (namely, the Moses tokenizer, as well as some others) before the main algorithm. However, this project's readme is somewhat vague regarding this detail. Do I understand it correctly that the only word-based tokenization implemented is basically splitting on spaces and that's it?
What confuses me is this quote: ours does not consider tokens that cross word boundaries
. For some languages it's impossible not to consider tokens that cross word boundaries based on spaces alone. So my question as follows: is there a more sophisticated word-based tokenizer under the hood after all?