ggml
ggml copied to clipboard
starcoder : add support for starchat special tokens
cc @TheBloke
Nice!
I noticed that the gpt_tokenize() function doesn't handle special tokens properly due to two issues:
- It creates regex without escaping special characters in tokens, so it doesn't work for tokens like
<|end|>which contains the special regex character| - It combines the special token pattern with the normal word pattern which greedily splits
' <|end|>'into' <|','end','|>'
So I updated gpt_tokenize() to handle this.
For 1, I simply escaped the special characters.
For 2, I first split the text by special tokens and then split the substrings in-between special tokens into words using the normal word pattern.