llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Strip trailing whitespace from prompt file

Open matthew-mcallister opened this issue 1 year ago • 6 comments

Many/most text editors save with trailing whitespace.

matthew-mcallister avatar Mar 13 '23 05:03 matthew-mcallister

Bystander question - does the trailing whitespace additional introduce tokens?

bengarney avatar Mar 13 '23 17:03 bengarney

Maybe just remove a trailing \n and keep any spaces as-is?

j-f1 avatar Mar 13 '23 17:03 j-f1

@bengarney New lines do add tokens

~@j-f1 Spaces do not add tokens~

The one limitation I see here is that you cannot intentionally add trailing newlines at the end of a file (to introduce a new paragraph). I could solve that by only stripping a single trailing '\n' or '\r\n' (the one usually added by the editor) and leaving any additional whitespace untouched.

matthew-mcallister avatar Mar 13 '23 17:03 matthew-mcallister

I don't think this is a good change. Machine Learning models should not teach people how to use text editors.

What if I want to keep whitespace in a prompt? This change makes it impossible.

And it's trivial to create a text file with no line endings at the end: echo foo > file.txt

prusnak avatar Mar 13 '23 17:03 prusnak

Usually single spaces do not add tokens because the space is inside a lot of tokens already. If you have a few more spaces, then it will take only one token because you have different tokens for 3,4, 5 spaces. But if you add a lot of spaces, it will add tokens.

If I add a lot of spaces after "Building", with make -j && ./main -m ./models/13B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512

I have:


  1 -> ''
  8893 -> 'Build'
   292 -> 'ing'
   462 -> '                '
   462 -> '                '
   462 -> '                '
   462 -> '                '
   462 -> '                '
   462 -> '                '
   462 -> '                '
  9651 -> '           '
  29874 -> 'a'
  4700 -> ' website'
...

leszekhanusz avatar Mar 13 '23 18:03 leszekhanusz

I don't think this is a good change. Machine Learning models should not teach people how to use text editors.

What if I want to keep whitespace in a prompt? This change makes it impossible.

And it's trivial to create a text file with no line endings at the end: echo foo > file.txt

OK, I almost feel like you're being smart here. Once a prompt is long enough (several sentences or multiple lines), I'm not going to want to edit it straight on the CLI just to avoid the unintended line break. This isn't a polished consumer product, but a tiny quality of life feature seems justifiable.

@leszekhanusz Good point, hadn't thought of that (some tokenizers truly do ignore whitespace). Though trailing whitespace at the end of a text file is still usually not intended.

Maybe a --preserve-whitespace flag is the right call here? Or better yet, just allow taking input from stdin so that if you don't want the whitespace stripped you can just do cat blah.txt | main -f -. Having an unwanted newline added every single time you use -f makes the flag a lot less useful in my eyes.

matthew-mcallister avatar Mar 13 '23 18:03 matthew-mcallister

@bengarney New lines do add tokens

~@j-f1 Spaces do not add tokens~

The one limitation I see here is that you cannot intentionally add trailing newlines at the end of a file (to introduce a new paragraph). I could solve that by only stripping a single trailing '\n' or '\r\n' (the one usually added by the editor) and leaving any additional whitespace untouched.

I decided to drop the trailing new line from file prompts: 70f01cb8632f73b5cf70428608b89cd3c0775d23

ggerganov avatar Mar 19 '23 17:03 ggerganov