private-gpt
private-gpt copied to clipboard
CPU utilization
CPU utilization appears to be capped at 20% Is there a way to increase CPU utilization and thereby enhance performance?
Hello, I assure you that you are not alone in this. This character doesn't use any CPU or GPU, which is frustrating because it makes the process slow. It might be wise to consider if there is a possibility to increase the power of the script.
It is not capped at 20%. I am successfully running it at 1600%. Please provide more information and I may be able to help.
Hello, to put it simply, when I ask a question to GPT, it takes ages to respond, and the processor or GPU is not being utilized at 100% or even 50%.
By default, the process will only use 4 threads. Try setting n_threads.
llm = GPT4All(model=model_path, n_threads=16, n_ctx=model_n_ctx, backend='llama', verbose=False)
I have CPU @ 99% and 1600% in top.
This configuration is located in the .env file. ?
In the query.py.
In the query.py.
There is no such a file in this repo. I've got the same issue with CPU utilisation
In this repo it's the PrivatePGT.py, line 38. Also, set the number to the number of cores you have.
I'm an end user trying to help. So chill habib. I'm doing this out of kindness. You can also search your repo...
Thanks for your contribution Jason - greatly appreciated.
In my case line 36 reads:
llm = GPT4All(model=model_path, n_ctx=model_n_ctx, backend='gptj', callbacks=callbacks, verbose=False)
So, I added the n_threads=12 parameter (12 physical and 24 virtual cores) to line 36 and it now reads:
llm = GPT4All(model=model_path, n_threads=12, n_ctx=model_n_ctx, backend='gptj', verbose=False)
No complaints on startup:
Using embedded DuckDB with persistence: data will be stored in: db
Found model file.
gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx = 2048
gptj_model_load: n_embd = 4096
gptj_model_load: n_head = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot = 64
gptj_model_load: f16 = 2
gptj_model_load: ggml ctx size = 5401.45 MB
gptj_model_load: kv self size = 896.00 MB
gptj_model_load: ................................... done
gptj_model_load: model size = 3609.38 MB / num tensors = 285
Enter a query:
However, CPU utilization curiously remained the same at 20%.
Upon further investigation, using Resource Monitor, I noticed that 6 of the 24 logical cores are actually working very hard, whilst the others occasionally blip. Increasing or decreasing the n_threads value does not reflect any change to the number of cores showing activity.
It's as though the repo I'm using is ignoring the n_thread parameter altogether.
Have I implemented it incorrectly?
What is your version of langchain? Are you up to date on the repo? v 0.179 does not use all of the threads.
Langchain version was 0.0.177 so updated to the latest repo and in the process got langchain v 0.197.
Ingest.py utilized 100% CPU but queries were still capped at 20% (6 virtual cores in my case).
However, when I added n_threads=24, to line 39 of privateGPT.py CPU utilization shot up to 100% with all 24 virtual cores working :)
Line 39 now reads: llm = GPT4All(model=model_path, n_threads=24, n_ctx=model_n_ctx, backend='gptj', n_batch=model_n_batch, callbacks=callbacks, verbose=False)
Thanks for your help Jason :)
So, I've tested the n_threads on AWS EC2, and so far the optimal value is 48. I don't understand why but with 72CPUs and 96CPUs, the response speed slowed down instead of increased, even the CPU utilization can go to 7000% and 9000% ... Any insights @SH1436 ?
@sshu2017 Can you tell me what is average response time to a question with this? I have it close to 20-45 seconds on an N series Azure VM. Also, the accuracy doesn't seem to be good. I have windows 10, all the libraries and set up worked as expected, no issues there. Am I missing something?
Hi @abhishekrai43 May I ask how you measured the accuracy? And does yours generate a full response?
Thanks in advance
@samanemami I got 5 people to ask it 50 questions. It came out to be be close to 50-60%. No, It can cut it off when it wants. It prints context 3 times more in size than the answer. So, I think context is what eats up the token limit cutting off the answer. Sometimes it takes 97 seconds to answer on a 16GB Windows. Looks like it is to be expected?
Thanks @abhishekrai43
Yes, it is about ~90 seconds, and I managed to reduce it to ~45 sec with more threads.
I wanted to reduce the time with various batch size but changing batch size terminate the process every time!
About the cutting off the answer
I did not understand, could please explain it more?
@samanemami truncated.
@samanemami truncated.
So have you found any approach to generate a full answer?
@samanemami Nopes
Hi @abhishekrai43 , sorry for the late reply. With more threads, now I can get a response in ~30 seconds. It was ~150 seconds with everything to the default values. So it's a big improvement but still not good enough.
llm = GPT4All(model=model_path, n_threads=24, n_ctx=model_n_ctx, backend='gptj', n_batch=model_n_batch, callbacks=callbacks, verbose=False)
In the above line what are the values for n_ctx and n_batch that you guys are using?
Hello, to put it simply, when I ask a question to GPT, it takes ages to respond, and the processor or GPU is not being utilized at 100% or even 50%.
Hey - can I ask, how are you getting this cli monitoring setup? I want to get that going on my pc.
Try to increase the value for n_thread parameter. For example if you have 8 cores and 2 threads per core then you can put max value up to 8*2=16 threads. Just don't give all
llm = GPT4All
that's the spirit! Nice!
https://github.com/imartinez/privateGPT/pull/1589