llama : tokenizer unicode codepoint categories
Add all unicode categories to unicode-data.cpp.
Currently we are limited to high categories:
- C, L, M, N, P, S, Z.
This PR allows access to subcategories:
- Cn, Cc, Cf, Co, Cs, Ll, Lm, Lo, Lt, Lu, Mc, Me, Mn, Nd, Nl, No, Pc, Pd, Pe, Pf, Pi, Po, Ps, Sc, Sk, Sm, So, Zl, Zp, Zs.
Related PR: https://github.com/ggerganov/llama.cpp/pull/8579, regex using Lu, Lt, Lm, Lo, etc.
TODO: Implement unicode regex collapse trick for all subcategories.
- [x] I have read the contributing guidelines
- Self-reported review complexity:
- [ ] Low
- [x] Medium
- [ ] High
Nice! This should also help fix (at least part of) Falcon's tokenization, because the Punctuation pre-tokenizer type uses the Po category and not the broader P one.
(ref: https://github.com/huggingface/tokenizers/blob/4ea2f235b0430f5db09f867b65306d6c0a5ec7ed/tokenizers/src/pre_tokenizers/punctuation.rs#L8, which uses Rust's is_ascii_punctuation and is_punctuation)
The src/llama.cpp conflict should be easy to resolve - just accept the new src/llama.cpp and apply the same changes to src/llama-vocab.cpp instead
TODO: Implement unicode regex collapse trick for all subcategories.
Do you expect any problems with this?
More problems than I thought:
- Need +29 collapse codepoints for subcategories.
- Ranges of collapse codepoints, ie:
\p{L}-->\p{Ll}to\p{Lu}(Ll, Lm, Lo, Lt, Lu). - Collapse codepoint for unicode whitespaces to fix the
\sproblem (std::regex ignores non-ASCII \s).- Take care of \S and regex lookaheads, ie:
(?!\S).
- Take care of \S and regex lookaheads, ie:
I tested (subset of the brute-force tests) all available BPE models, including tekken. Same results as before this PR.
Also tested the original tekken regex and seems correct too.
The reimplementation is not very understandable without context. I want to add more comments and try to explain all steps/blocks of code.