llm icon indicating copy to clipboard operation
llm copied to clipboard

GPT-J Will Not Accept Certain Tokens in Prompt

Open danforbes opened this issue 2 years ago • 5 comments

GPT-J does not like tokenizing certain characters when they appear in a prompt - so far I have only been able to induce this behavior with a ! character, but I haven't performed an exhaustive search.

llm: ./target/release/llm gptj infer -m ~/.ggml-models/gpt4all-j-v1.3-groovy.bin -p "!"
✓ Loaded 285 tensors (3.8 GB) after 1980ms

[2023-05-11T14:36:15Z ERROR llm] Failed to tokenize initial prompt.

danforbes avatar May 11 '23 14:05 danforbes

Our current tokenizer is built around scores. Perhaps we should use a simpler tokenizer for the models where it's known no score is present for the tokens?

philpax avatar May 11 '23 14:05 philpax

Couldn't we use huggingfaces tokenizer? Then we would have parity with nearly every implementation out there 🤔

LLukas22 avatar May 14 '23 13:05 LLukas22

Yeah maybe, see #35

philpax avatar May 15 '23 23:05 philpax

@RedBoxing - can you see if this is fixed on your RWKV branch?

danforbes avatar May 19 '23 14:05 danforbes

no issues at all !

RedBoxing avatar May 19 '23 15:05 RedBoxing