LLM is worse at non-English languages
Andrej in his YouTube video noted that LLMs are worse at non-English languages, partly due to tokenization. Basically, for less represented languages, even frequent pairs of characters appear less frequently in the corpus than most of English pairs. Hence, fewer merges occur in BPE for these languages and their token representation ends up being lengthy. Isn’t it a good idea to build tokens for each language separately and for distinct domains e.g., python code?
I genuinely think building separate tokenizers for each language or domain, like Python code, is a great idea. It just makes sense different languages and fields have their own structure and quirks, so a one-size-fits-all tokenizer isn’t always ideal. For instance, in programming, having tokens that align with common syntax can make models work faster and smarter. And for lesser-known languages, a dedicated tokenizer could really help preserve meaning and improve results. It might add a bit of complexity, but the potential for better understanding and performance makes it totally worth exploring.