llama.cpp llama : tokenizer unicode codepoint categories

Add all unicode categories to unicode-data.cpp.

Currently we are limited to high categories:

C, L, M, N, P, S, Z.

This PR allows access to subcategories:

Cn, Cc, Cf, Co, Cs, Ll, Lm, Lo, Lt, Lu, Mc, Me, Mn, Nd, Nl, No, Pc, Pd, Pe, Pf, Pi, Po, Ps, Sc, Sk, Sm, So, Zl, Zp, Zs.

Related PR: https://github.com/ggerganov/llama.cpp/pull/8579, regex using Lu, Lt, Lm, Lo, etc.

TODO: Implement unicode regex collapse trick for all subcategories.

[x] I have read the contributing guidelines
Self-reported review complexity:
- [ ] Low
- [x] Medium
- [ ] High

Jul 20 '24 21:07 jaime-m-p

Nice! This should also help fix (at least part of) Falcon's tokenization, because the Punctuation pre-tokenizer type uses the Po category and not the broader P one.

(ref: https://github.com/huggingface/tokenizers/blob/4ea2f235b0430f5db09f867b65306d6c0a5ec7ed/tokenizers/src/pre_tokenizers/punctuation.rs#L8, which uses Rust's is_ascii_punctuation and is_punctuation)

Jul 21 '24 01:07 compilade

The src/llama.cpp conflict should be easy to resolve - just accept the new src/llama.cpp and apply the same changes to src/llama-vocab.cpp instead

Jul 23 '24 10:07 ggerganov

TODO: Implement unicode regex collapse trick for all subcategories.

Do you expect any problems with this?

More problems than I thought:

Need +29 collapse codepoints for subcategories.
Ranges of collapse codepoints, ie: \p{L} --> \p{Ll} to \p{Lu} (Ll, Lm, Lo, Lt, Lu).
Collapse codepoint for unicode whitespaces to fix the \s problem (std::regex ignores non-ASCII \s).
- Take care of \S and regex lookaheads, ie: (?!\S).

Jul 25 '24 23:07 jaime-m-p

I tested (subset of the brute-force tests) all available BPE models, including tekken. Same results as before this PR. Also tested the original tekken regex and seems correct too.

The reimplementation is not very understandable without context. I want to add more comments and try to explain all steps/blocks of code.

Jul 25 '24 23:07 jaime-m-p