ggml icon indicating copy to clipboard operation
ggml copied to clipboard

starcoder : add support for starchat special tokens

Open marella opened this issue 2 years ago • 2 comments

cc @TheBloke

marella avatar Jun 11 '23 15:06 marella

Nice!

TheBloke avatar Jun 11 '23 16:06 TheBloke

I noticed that the gpt_tokenize() function doesn't handle special tokens properly due to two issues:

  1. It creates regex without escaping special characters in tokens, so it doesn't work for tokens like <|end|> which contains the special regex character |
  2. It combines the special token pattern with the normal word pattern which greedily splits ' <|end|>' into ' <|', 'end', '|>'

So I updated gpt_tokenize() to handle this. For 1, I simply escaped the special characters. For 2, I first split the text by special tokens and then split the substrings in-between special tokens into words using the normal word pattern.

marella avatar Jun 12 '23 21:06 marella