llama.cpp
llama.cpp copied to clipboard
LLM inference in C/C++
I've been testing your code from 1 to 8 threads and the output is always different. The speed is not depend on the number of threads. On the contrary, 4...
In this case the llama.cpp and the llama tokenizers produce different output: ``` main: prompt: 'This is 🦙.cpp' main: number of tokens in prompt = 10 1 -> '' 4013...
It's really annoying that I have to restart the program every time it quits by **[end of text]** or exceeding context limits, as I need to reload model, which is...
hi team, was playing interactive mode for couple hours, pretty impressive resides what's mentioned in #145 , it might be not too far, to plug this a endpoint / functional...
I really thank you for the possibility of running the model on my MacBook Air M1. I've been testing various parameters and I'm happy even with the 7B model. However,...
As suggested in #146 we are able to save lots of memory by using float16 instead of float32. I implemented the suggested changes, and tested with the 7B and 13B...
I have no clue about this, but I saw that chatglm-6b was published, which should run on CPU with 16GB ram, albeit very slow. [https://huggingface.co/THUDM/chatglm-6b/tree/main](url) Would it be possible to...
Hi, What models i really need? I have these: The only 7B folder for example is necessary? Each model has different results? I don't understand if i need only one...