private-gpt
private-gpt copied to clipboard
Time from prompt to response is too long! Very Slow?
So I setup on 128GB RAM and 32 cores. I also used wizard vicuna for the llm model. I noticed that no matter the parameter size of the model, either 7b, 13b, 30b, etc, the prompt takes too long to generate a reply? I ingested a 4,000KB txt book, which took 6 minutes to ingest into the cromadb, so I was very happy that it does not take too long to ingest anymore. Unfortunately, when I enter a prompt I have to wait almost a minute before I get a reply. My PC is very powerful and lots of space on the RAM and CPU, because nothing else is running.
How can we speed this up? There needs to be an option to allow more PC resources to be used to improve prompt speed!
The slow speed during interaction is mostly caused by LLM. I can see that the default num threads (param n_threads) for the LLM are 4.
match model_type:
case "LlamaCpp":
llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks, verbose=False)
case "GPT4All":
llm = GPT4All(model=model_path, n_ctx=model_n_ctx, backend='gptj', callbacks=callbacks, verbose=False)
case _default:
print(f"Model {model_type} not supported!")
exit;
You can increase the speed of your LLM model by putting n_threads=16 or more to whatever you want to speed up your inferencing
case "LlamaCpp":
llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks, verbose=False, n_threads=16)
So I setup on 128GB RAM and 32 cores. I also used wizard vicuna for the llm model. I noticed that no matter the parameter size of the model, either 7b, 13b, 30b, etc, the prompt takes too long to generate a reply? I ingested a 4,000KB txt book, which took 6 minutes to ingest into the cromadb, so I was very happy that it does not take too long to ingest anymore. Unfortunately, when I enter a prompt I have to wait almost a minute before I get a reply. My PC is very powerful and lots of space on the RAM and CPU, because nothing else is running.
How can we speed this up? There needs to be an option to allow more PC resources to be used to improve prompt speed!
Hi ,
Could you let me know where can I download the correct version to run privateGPT?
Using embedded DuckDB with persistence: data will be stored in: db
gptj_model_load: loading model from 'models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin' - please wait ...
gptj_model_load: invalid model file 'models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin' (bad magic)
I trust I downloaded the wrong version.
Hi ,
Could you let me know where can I download the correct version to run privateGPT?
Using embedded DuckDB with persistence: data will be stored in: db gptj_model_load: loading model from 'models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin' - please wait ... gptj_model_load: invalid model file 'models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin' (bad magic)
I trust I downloaded the wrong version.
Have you tried the vicuna model in the gpt4all page yet? https://gpt4all.io/index.html
Hi , Could you let me know where can I download the correct version to run privateGPT?
Using embedded DuckDB with persistence: data will be stored in: db gptj_model_load: loading model from 'models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin' - please wait ... gptj_model_load: invalid model file 'models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin' (bad magic)
I trust I downloaded the wrong version.
Have you tried the vicuna model in the gpt4all page yet? https://gpt4all.io/index.html
Using embedded DuckDB with persistence: data will be stored in: db gptj_model_load: loading model from 'models/ggml-vicuna-13b-1.1-q4_2.bin' - please wait ... gptj_model_load: invalid model file 'models/ggml-vicuna-13b-1.1-q4_2.bin' (bad magic)
NO luck ...
@laihenyi See https://github.com/imartinez/privateGPT/issues/276#issuecomment-1554262627
@laihenyi See #276 (comment)
Thanks ... Getting closer ... However, Something still went wrong.
Using embedded DuckDB with persistence: data will be stored in: db
llama_model_load: loading model from 'models/ggml-vicuna-13b-1.1-q4_2.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 1000
llama_model_load: n_embd = 5120
llama_model_load: n_mult = 256
llama_model_load: n_head = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot = 128
llama_model_load: f16 = 5
llama_model_load: n_ff = 13824
llama_model_load: n_parts = 2
llama_model_load: type = 2
llama_model_load: invalid model file 'models/ggml-vicuna-13b-1.1-q4_2.bin' (bad f16 value 5)
llama_init_from_file: failed to load model
Segmentation fault: 11
any idea ?
Sorry, not sure. I don't have that model to test, but the 13b-q5
works with MODEL_TYPE=LlamaCpp
. It would also be good if you specify exactly what you did to get that output, and if other models work for you.
ingest is lighting fast now. Answering questions is much slower. Took about 4-5 minutes to answer a question. The answer is total nonsense which starts looking into my DND pdf but ends with talking about heart disease. WTH
PERSIST_DIRECTORY=db MODEL_TYPE=LlamaCpp MODEL_PATH=models/koala-7b.ggml.unquantized.pr613.bin EMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2 MODEL_N_CTX=1000
use n_threads and set it to 30 or sometjing in the functions that have "embeddings"...
use n_threads and set it to 30 or sometjing in the functions that have "embeddings"...
Can you explain in more detail? Where am I making these changes? Perhaps a screenshot or copy paste an example?
Hi , Could you let me know where can I download the correct version to run privateGPT?
Using embedded DuckDB with persistence: data will be stored in: db gptj_model_load: loading model from 'models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin' - please wait ... gptj_model_load: invalid model file 'models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin' (bad magic)
I trust I downloaded the wrong version.
Have you tried the vicuna model in the gpt4all page yet? https://gpt4all.io/index.html
Haven't found any model that works with MODEL_TYPE=LlamaCpp on https://gpt4all.io/index.html. Anyone can suggest where to find a model that works with LlamaCpp?
Hi , Could you let me know where can I download the correct version to run privateGPT?
Using embedded DuckDB with persistence: data will be stored in: db gptj_model_load: loading model from 'models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin' - please wait ... gptj_model_load: invalid model file 'models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin' (bad magic)
I trust I downloaded the wrong version.
Have you tried the vicuna model in the gpt4all page yet? https://gpt4all.io/index.html
Haven't found any model that works with MODEL_TYPE=LlamaCpp on https://gpt4all.io/index.html. Anyone can suggest where to find a model that works with LlamaCpp?
My issue is resolved by following: https://github.com/imartinez/privateGPT/issues/220
I've tried it on a Intel(R) Xeon(R) E-2236 CPU @ 3.40GHz with 64G of RAM, was faster than my initial testing on an I5 with 32G RAM. when it does answer, it answers well based on the union training text.
While OpenChatKit will run on a 4GB GPU (slowly!) and performs better on a 12GB GPU, I don't have the resources to train it on 8 x A100 GPUs. So I love the idea of this bot and how it can be easily trained from private data with low resources. I don't care really how long it takes to train, but would like snappier answer times. I'm only new to AI and python, so cannot contribute anything of real value yet but I'm working on it!.
Would be nice to train on CPU and inference on GPU, anyway, thanks for creating this!
cheers
@jcrsantiago to add threads just change it in privateGPT.py
file:
llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks, verbose=False, n_threads=8)
See n_threads=
added at the end.
@jcrsantiago to add threads just change it in
privateGPT.py
file:
llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks, verbose=False, n_threads=8)
See
n_threads=
added at the end.
I got a not permitted error after adding n_threads
https://huggingface.co/TheBloke/wizard-vicuna-13B-GGML/tree/main
Find other LLM models here
I tested on host (72G RAM, 36 cores AWS EC2 - c5.9xlarge), ingesting is super fast. On inference, I did not get the dreaded unknown token messages, so I assume those messages are the result of not enough memory. the .env file:
PERSIST_DIRECTORY=db
MODEL_TYPE=GPT4All
MODEL_PATH=models/ggml-gpt4all-j-v1.3-groovy.bin
EMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2
The first test was against the provided text in the repo with no other alterations using the default model. The second test was run setting the n_threads to 32.
llm = GPT4All(model=model_path, n_ctx=model_n_ctx, backend='gptj', callbacks=callbacks, verbose=False,n_threads=32)
The question for both tests was:
"how will inflation be handled?"
Test 1 time: 1 minute 57 seconds Test 2 time: 1 minute 58 seconds
The same question on an I5 with 32G of RAM: 1 minute 24 seconds
Performance seemed slower with more resources, not sure if its because the larger host was a VM? The larger host streamed the answer faster than the I5 test but took longer to 'think' about it.
Is there some kind of internal constraint that could be adjusted?
Update to the latest version of langchain and the n_threads
will be respected.
I can confirm a similar behavior. I have t14s with Ryzen 7 (16 threads). By default, privateGPT utilizes 4 threads, and queries are answered in 180s on average. With 8 threads they are answered in 90s. With 12/16 threads it slows down by circa 20 seconds.
Update to the latest version of langchain and the
n_threads
will be respected.
How do you update it? Apparently I have no improvements by increasing the threads...
Update to the latest version of langchain and the
n_threads
will be respected.How do you update it? Apparently I have no improvements by increasing the threads...
just do git pull
I confirm the same, I'm running it in physical server (Intel® Xeon(R) CPU E5-2620 0 @ 2.00GHz × 24) (16 GB RAM). It's very slow and it took 317.0 s to answer.
Here I'm attaching terminal output if it can help.
python3 privateGPT.py Using embedded DuckDB with persistence: data will be stored in: db Found model file at models/ggml-gpt4all-j-v1.3-groovy.bin gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ... gptj_model_load: n_vocab = 50400 gptj_model_load: n_ctx = 2048 gptj_model_load: n_embd = 4096 gptj_model_load: n_head = 16 gptj_model_load: n_layer = 28 gptj_model_load: n_rot = 64 gptj_model_load: f16 = 2 gptj_model_load: ggml ctx size = 5401.45 MB gptj_model_load: kv self size = 896.00 MB gptj_model_load: ................................... done gptj_model_load: model size = 3609.38 MB / num tensors = 285
I confirm the same, I'm running it in physical server (Intel® Xeon(R) CPU E5-2620 0 @ 2.00GHz × 24) (16 GB RAM). It's very slow and it took 317.0 s to answer.
Here I'm attaching terminal output if it can help.
python3 privateGPT.py Using embedded DuckDB with persistence: data will be stored in: db Found model file at models/ggml-gpt4all-j-v1.3-groovy.bin gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ... gptj_model_load: n_vocab = 50400 gptj_model_load: n_ctx = 2048 gptj_model_load: n_embd = 4096 gptj_model_load: n_head = 16 gptj_model_load: n_layer = 28 gptj_model_load: n_rot = 64 gptj_model_load: f16 = 2 gptj_model_load: ggml ctx size = 5401.45 MB gptj_model_load: kv self size = 896.00 MB gptj_model_load: ................................... done gptj_model_load: model size = 3609.38 MB / num tensors = 285
Interesting! Did you try to use various n_threads? As I mentioned above, for 8 threads it's 2x faster than default, but more threads are slowing down everything more and more...
Interesting! Did you try to use various n_threads? As I mentioned above, for 8 threads it's 2x faster than default, but more threads are slowing down everything more and more...
I did use the n_threads=20, here is my code:
match model_type:
case "LlamaCpp":
llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, n_batch=model_n_batch, callbacks=callbacks, verbose=False, n_threads=8)
case "GPT4All":s
llm = GPT4All(model=model_path, n_ctx=model_n_ctx, backend='gptj', n_batch=model_n_batch, callbacks=callbacks, verbose=False, n_threads=8)
case _default:
# raise exception if model_type is not supported
raise Exception(f"Model type {model_type} is not supported. Please choose one of the following: LlamaCpp, GPT4All")
After your comment I changed it to 8: first answer took 219 s. second answer took 677.69 s!
It seems that if n_threads > 8, it will increase the time response, I have the same issues here
After adding n_threads to the privateGPT.py, it seems I'm only able to get the response to my question after hitting Ctrl C
.
I'm using a Mac.
This is my .env file
By the way, I've tried using nous-hermes-llama2-13b.ggmlv3.q3_K_M.bin
. After installing it, I couldn't get it to work. Any form of assistance would be really appreciated.🙏🏾
From my research, I believe using nous-hermes-llama2-13b.ggmlv3.q3_K_M.bin
would get me a quicker response but I cannot figure out how to make it work. I'm referring to the .env
file. A nudge in the right direction will be appreciated.