llama.cpp
llama.cpp copied to clipboard
Strip trailing whitespace from prompt file
Many/most text editors save with trailing whitespace.
Bystander question - does the trailing whitespace additional introduce tokens?
Maybe just remove a trailing \n
and keep any spaces as-is?
@bengarney New lines do add tokens
~@j-f1 Spaces do not add tokens~
The one limitation I see here is that you cannot intentionally add trailing newlines at the end of a file (to introduce a new paragraph). I could solve that by only stripping a single trailing '\n' or '\r\n' (the one usually added by the editor) and leaving any additional whitespace untouched.
I don't think this is a good change. Machine Learning models should not teach people how to use text editors.
What if I want to keep whitespace in a prompt? This change makes it impossible.
And it's trivial to create a text file with no line endings at the end: echo foo > file.txt
Usually single spaces do not add tokens because the space is inside a lot of tokens already. If you have a few more spaces, then it will take only one token because you have different tokens for 3,4, 5 spaces. But if you add a lot of spaces, it will add tokens.
If I add a lot of spaces after "Building", with make -j && ./main -m ./models/13B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512
I have:
1 -> ''
8893 -> 'Build'
292 -> 'ing'
462 -> ' '
462 -> ' '
462 -> ' '
462 -> ' '
462 -> ' '
462 -> ' '
462 -> ' '
9651 -> ' '
29874 -> 'a'
4700 -> ' website'
...
I don't think this is a good change. Machine Learning models should not teach people how to use text editors.
What if I want to keep whitespace in a prompt? This change makes it impossible.
And it's trivial to create a text file with no line endings at the end:
echo foo > file.txt
OK, I almost feel like you're being smart here. Once a prompt is long enough (several sentences or multiple lines), I'm not going to want to edit it straight on the CLI just to avoid the unintended line break. This isn't a polished consumer product, but a tiny quality of life feature seems justifiable.
@leszekhanusz Good point, hadn't thought of that (some tokenizers truly do ignore whitespace). Though trailing whitespace at the end of a text file is still usually not intended.
Maybe a --preserve-whitespace
flag is the right call here? Or better yet, just allow taking input from stdin so that if you don't want the whitespace stripped you can just do cat blah.txt | main -f -
. Having an unwanted newline added every single time you use -f
makes the flag a lot less useful in my eyes.
@bengarney New lines do add tokens
~@j-f1 Spaces do not add tokens~
The one limitation I see here is that you cannot intentionally add trailing newlines at the end of a file (to introduce a new paragraph). I could solve that by only stripping a single trailing '\n' or '\r\n' (the one usually added by the editor) and leaving any additional whitespace untouched.
I decided to drop the trailing new line from file prompts: 70f01cb8632f73b5cf70428608b89cd3c0775d23