llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Bug: Or Feature? BPE Tokenization mutates whitespaces into double-whitespace tokens when add_prefix_space is true (default)

Open cmp-nct opened this issue 1 year ago • 0 comments

What happened?

This is a bit discussed here already: https://github.com/ggerganov/llama.cpp/issues/7938 <|assistant|>

32001 -> '<|assistant|>'
   259 -> '  '

Also <|assistant|>\n:

32001 -> '<|assistant|>'
29871 -> ' '
    13 -> '
'

What happens is that the single whitespace, that follows a special token is mutated into a double-whitespace token (259) because add_prefix_space is triggered in llama.cpp when a special token is encountered.

In the second example the template actually wants a \n after assistant, however the special behavior sneaks a space in between.

Is this intended behavior / correct ?

When running PHI3 and asking for a generation after <|assistant|>, phi3 is adamant in responding with a whitespace or a combination token that starts with a whitespace. When disabling add_prefix_whitespace and adding a \n after assistant, this issue is resolved and phi responds right away with normal text.

Name and Version

ba58993

What operating system are you seeing the problem on?

Windows

Relevant log output

No response

cmp-nct avatar Jun 20 '24 01:06 cmp-nct