minbpe
minbpe copied to clipboard
Alternative to bpe
Maybe I am completely wrong, but to me using something like bpe to build an encoding for text feels stupid. Sure, it is a fairly easy way and it will build an encoding that is efficient in terms of sequence length, but is that the only requirement for such an encoding? Would using an encoding that makes sense not make training and inference easier? Should we not engineer an encoding instead? Using a-priori knowledge of languages from dictionaries for instance?