llama.cpp unexpected shut down when number of tokens is large

I found that the model of LLaMA-7B shut down unexpectedly when the number of tokens in prompt reaches some value, this value is approximately to be 500 this cannot be solved by setting number of tokens to predict high (e.g. 204800)

my initialization is:

./main -m ./models/7B/ggml-model-q4_0.bin \
-n 204800 \
-t 8 \
--repeat_penalty 1.0 \
--color -i \
-r "HeMuling:" \
--temp 1.0 \
-f ./models/p.txt

where p.txt is a file containing some prompts, and the token number of prompts is main: number of tokens in prompt = 486 the program shut down unexpectedly after a few interactions, last shows:

Allice:like how big
HeMuling

main: mem per token = 14434244 bytes
main:     load time =  1400.10 ms
main:   sample time =    21.30 ms
main:  predict time = 79072.03 ms / 154.74 ms per token
main:    total time = 88429.08 ms

I am using macPro M1 with 16GB RAM

I am wondering is there any limitation in the program or did i do something wrong

Mar 14 '23 14:03 HeMuling

I have the same problem as you, and here is my tests

I have changed the parameter -n to -n 1048 the context is longer by almost ~50% but still not able to generate long text, then changed it to -n 4096 and got almost the same length as -n 1048

Info: CPU: 8 cores RAM: 16G Model: 7B RAM used during generation is 4.6G

256 Limit (Almost 8 lines)

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - If you want to submit another line, end your input in '\'.
Here is a list of 100 sentences in the context of IT 1. IT is important is terms of technology 2. IT is important in terms of technology. 3. Information Technology is important in terms of technology. 4. Information Technology is important in terms of technology. 5. Information Technology is important in terms of technology. 6. Information Technology is important in terms of technology. 7. Information Technology is important in terms of technology. 8. Information Technology is important in terms of technology. 9. Information Technology is important in terms of technology. 10. Information Technology is important in terms of technology. 11. Information Technology is important in terms of technology. 12. Information Technology is important in terms of technology. 13. Information Technology is important in terms of technology. 14. Information Technology is important in terms of technology. 15. Information Technology is important in terms of technology. 16. Information Technology is important in terms of technology. 17. Information Technology is important in terms of technology. 18. Information Technology is important in terms of technology. 19. Information Technology is important in terms of technology. 20. Information Technology is important in terms of technology. 21. Information Technology is important in terms of technology. 22. Information Technology is important

main: mem per token = 14434244 bytes
main:     load time =  2390.74 ms
main:   sample time =   156.97 ms
main:  predict time = 58222.04 ms / 205.01 ms per token
main:    total time = 61601.12 ms

1048 Limit (Almost 12 lines)

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - If you want to submit another line, end your input in '\'.
Here is a list of 100 sentences in the context of IT 1. IT is important is terms of technology 2. IT is very important in terms of technology 3. IT is important in terms of technology. 4. IT is very important in terms of technology 5. IT is important in terms of technology 6. IT is very important in terms of technology 7. IT is important in terms of technology 8. IT is very important in terms of technology 9. IT is important in terms of technology 10. IT is very important in terms of technology 11. IT is important in terms of technology 12. IT is very important in terms of technology 13. IT is important in terms of technology 14. IT is very important in terms of technology 15. IT is important in terms of technology 16. IT is very important in terms of technology 17. IT is important in terms of technology 18. IT is very important in terms of technology 19. IT is important in terms of technology 20. IT is very important in terms of technology 21. IT is important in terms of technology 22. IT is very important in terms of technology 23. IT is important in terms of technology 24. IT is very important in terms of technology 25. IT is important in terms of technology 26. IT is very important in terms of technology 27. IT is important in terms of technology 28. IT is very important in terms of technology 29. IT is important in terms of technology 30. IT is very important in terms of technology 31. IT is important in terms of technology 32. IT is very important in terms of technology 33. IT is important in terms of technology 34. IT is very important in terms of technology 35. IT is important in terms of technology 36. IT is very important in terms of technology 37. IT is important in terms of technology 38. IT is very important in terms of technology 39. IT is important in terms of technology 40. IT is very important in terms of technology 41. IT is important in terms of technology 42. IT is very important in terms of technology 43. IT is important in terms of technology 44. IT is very important in terms

main: mem per token = 14434244 bytes
main:     load time =  2677.93 ms
main:   sample time =   287.36 ms
main:  predict time = 108363.11 ms / 212.06 ms per token
main:    total time = 112125.02 ms

I have tried with the Model 13B, it is so slow, uses about 8G of RAM, and output was about 9 Lines with -n 1048, and outputs about 12 Lines with -n 4048

Mar 14 '23 14:03 Khalilbz

Duplicate of #71

Depending on how much memory you have you can increase the context size to get longer outputs. On a 64gb machine I was able to have a 12k context with the 7B model and 2k context with the 65B model. You can change it here

Originally posted by @eous in https://github.com/ggerganov/llama.cpp/issues/71#issuecomment-1465496459

Mar 14 '23 17:03 AndrewKeYanzhe

problem solved thanks to @AndrewKeYanzhe's help, here is solution:

in the file main.cpp, change line 822: https://github.com/ggerganov/llama.cpp/blob/460c48254098b28d422382a2bbff6a0b3d7f7e17/main.cpp#L794

in to (number size can be adjusted according to RAM):

if (!llama_model_load(params.model, model, vocab, 2048)) {  // TODO: set context from user input ??

the code should be able to give a 2048-words of context reference

then run the code in terminal:

cd llama.cpp
make

to re-compile

Mar 14 '23 23:03 HeMuling

llama.cpp llama.cpp copied to clipboard

unexpected shut down when number of tokens is large

llama.cpp
llama.cpp copied to clipboard