llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

llama : tokenizer unicode codepoint categories

Open jaime-m-p opened this issue 1 year ago • 2 comments

Add all unicode categories to unicode-data.cpp.

Currently we are limited to high categories:

  • C, L, M, N, P, S, Z.

This PR allows access to subcategories:

  • Cn, Cc, Cf, Co, Cs, Ll, Lm, Lo, Lt, Lu, Mc, Me, Mn, Nd, Nl, No, Pc, Pd, Pe, Pf, Pi, Po, Ps, Sc, Sk, Sm, So, Zl, Zp, Zs.

Related PR: https://github.com/ggerganov/llama.cpp/pull/8579, regex using Lu, Lt, Lm, Lo, etc.

TODO: Implement unicode regex collapse trick for all subcategories.


jaime-m-p avatar Jul 20 '24 21:07 jaime-m-p

Nice! This should also help fix (at least part of) Falcon's tokenization, because the Punctuation pre-tokenizer type uses the Po category and not the broader P one.

(ref: https://github.com/huggingface/tokenizers/blob/4ea2f235b0430f5db09f867b65306d6c0a5ec7ed/tokenizers/src/pre_tokenizers/punctuation.rs#L8, which uses Rust's is_ascii_punctuation and is_punctuation)

compilade avatar Jul 21 '24 01:07 compilade

The src/llama.cpp conflict should be easy to resolve - just accept the new src/llama.cpp and apply the same changes to src/llama-vocab.cpp instead

ggerganov avatar Jul 23 '24 10:07 ggerganov

TODO: Implement unicode regex collapse trick for all subcategories.

Do you expect any problems with this?

More problems than I thought:

  • Need +29 collapse codepoints for subcategories.
  • Ranges of collapse codepoints, ie: \p{L} --> \p{Ll} to \p{Lu} (Ll, Lm, Lo, Lt, Lu).
  • Collapse codepoint for unicode whitespaces to fix the \s problem (std::regex ignores non-ASCII \s).
    • Take care of \S and regex lookaheads, ie: (?!\S).

jaime-m-p avatar Jul 25 '24 23:07 jaime-m-p

I tested (subset of the brute-force tests) all available BPE models, including tekken. Same results as before this PR. Also tested the original tekken regex and seems correct too.

The reimplementation is not very understandable without context. I want to add more comments and try to explain all steps/blocks of code.

jaime-m-p avatar Jul 25 '24 23:07 jaime-m-p