private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

CPU utilization

Open SH1436 opened this issue 1 year ago • 11 comments

CPU utilization appears to be capped at 20% Is there a way to increase CPU utilization and thereby enhance performance?

SH1436 avatar Jun 11 '23 06:06 SH1436

Hello, I assure you that you are not alone in this. This character doesn't use any CPU or GPU, which is frustrating because it makes the process slow. It might be wise to consider if there is a possibility to increase the power of the script.

Univers4craft avatar Jun 11 '23 13:06 Univers4craft

It is not capped at 20%. I am successfully running it at 1600%. Please provide more information and I may be able to help.

JasonMaggard avatar Jun 12 '23 14:06 JasonMaggard

Hello, to put it simply, when I ask a question to GPT, it takes ages to respond, and the processor or GPU is not being utilized at 100% or even 50%. Capture d’écran du 2023-06-12 20-00-14

Univers4craft avatar Jun 12 '23 18:06 Univers4craft

By default, the process will only use 4 threads. Try setting n_threads.

llm = GPT4All(model=model_path, n_threads=16, n_ctx=model_n_ctx, backend='llama', verbose=False)

I have CPU @ 99% and 1600% in top.

Screenshot 2023-06-12 at 2 46 53 PM

JasonMaggard avatar Jun 12 '23 18:06 JasonMaggard

This configuration is located in the .env file. ?

Univers4craft avatar Jun 12 '23 18:06 Univers4craft

In the query.py.

JasonMaggard avatar Jun 12 '23 19:06 JasonMaggard

In the query.py.

There is no such a file in this repo. I've got the same issue with CPU utilisation

habib-the-sweet avatar Jun 12 '23 19:06 habib-the-sweet

In this repo it's the PrivatePGT.py, line 38. Also, set the number to the number of cores you have.

I'm an end user trying to help. So chill habib. I'm doing this out of kindness. You can also search your repo...

JasonMaggard avatar Jun 12 '23 20:06 JasonMaggard

Thanks for your contribution Jason - greatly appreciated.

In my case line 36 reads:

llm = GPT4All(model=model_path, n_ctx=model_n_ctx, backend='gptj', callbacks=callbacks, verbose=False)

So, I added the n_threads=12 parameter (12 physical and 24 virtual cores) to line 36 and it now reads:

llm = GPT4All(model=model_path, n_threads=12, n_ctx=model_n_ctx, backend='gptj', verbose=False)

No complaints on startup: Using embedded DuckDB with persistence: data will be stored in: db Found model file. gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ... gptj_model_load: n_vocab = 50400 gptj_model_load: n_ctx = 2048 gptj_model_load: n_embd = 4096 gptj_model_load: n_head = 16 gptj_model_load: n_layer = 28 gptj_model_load: n_rot = 64 gptj_model_load: f16 = 2 gptj_model_load: ggml ctx size = 5401.45 MB gptj_model_load: kv self size = 896.00 MB gptj_model_load: ................................... done
gptj_model_load: model size = 3609.38 MB / num tensors = 285

Enter a query:

However, CPU utilization curiously remained the same at 20%.

Upon further investigation, using Resource Monitor, I noticed that 6 of the 24 logical cores are actually working very hard, whilst the others occasionally blip. Increasing or decreasing the n_threads value does not reflect any change to the number of cores showing activity.

It's as though the repo I'm using is ignoring the n_thread parameter altogether.

Have I implemented it incorrectly?

SH1436 avatar Jun 13 '23 04:06 SH1436

What is your version of langchain? Are you up to date on the repo? v 0.179 does not use all of the threads.

JasonMaggard avatar Jun 13 '23 13:06 JasonMaggard

Langchain version was 0.0.177 so updated to the latest repo and in the process got langchain v 0.197.

Ingest.py utilized 100% CPU but queries were still capped at 20% (6 virtual cores in my case).

However, when I added n_threads=24, to line 39 of privateGPT.py CPU utilization shot up to 100% with all 24 virtual cores working :)

Line 39 now reads: llm = GPT4All(model=model_path, n_threads=24, n_ctx=model_n_ctx, backend='gptj', n_batch=model_n_batch, callbacks=callbacks, verbose=False)

Thanks for your help Jason :)

SH1436 avatar Jun 13 '23 20:06 SH1436

So, I've tested the n_threads on AWS EC2, and so far the optimal value is 48. I don't understand why but with 72CPUs and 96CPUs, the response speed slowed down instead of increased, even the CPU utilization can go to 7000% and 9000% ... Any insights @SH1436 ?

sshu2017 avatar Jun 24 '23 07:06 sshu2017

@sshu2017 Can you tell me what is average response time to a question with this? I have it close to 20-45 seconds on an N series Azure VM. Also, the accuracy doesn't seem to be good. I have windows 10, all the libraries and set up worked as expected, no issues there. Am I missing something?

abhishekrai43 avatar Jun 27 '23 08:06 abhishekrai43

Hi @abhishekrai43 May I ask how you measured the accuracy? And does yours generate a full response?

Thanks in advance

samanemami avatar Jun 27 '23 12:06 samanemami

@samanemami I got 5 people to ask it 50 questions. It came out to be be close to 50-60%. No, It can cut it off when it wants. It prints context 3 times more in size than the answer. So, I think context is what eats up the token limit cutting off the answer. Sometimes it takes 97 seconds to answer on a 16GB Windows. Looks like it is to be expected?

abhishekrai43 avatar Jun 27 '23 12:06 abhishekrai43

Thanks @abhishekrai43

Yes, it is about ~90 seconds, and I managed to reduce it to ~45 sec with more threads. I wanted to reduce the time with various batch size but changing batch size terminate the process every time! About the cutting off the answer I did not understand, could please explain it more?

samanemami avatar Jun 28 '23 07:06 samanemami

@samanemami truncated.

abhishekrai43 avatar Jun 28 '23 08:06 abhishekrai43

@samanemami truncated.

So have you found any approach to generate a full answer?

samanemami avatar Jun 28 '23 10:06 samanemami

@samanemami Nopes

abhishekrai43 avatar Jun 28 '23 12:06 abhishekrai43

Hi @abhishekrai43 , sorry for the late reply. With more threads, now I can get a response in ~30 seconds. It was ~150 seconds with everything to the default values. So it's a big improvement but still not good enough.

sshu2017 avatar Jul 06 '23 17:07 sshu2017

llm = GPT4All(model=model_path, n_threads=24, n_ctx=model_n_ctx, backend='gptj', n_batch=model_n_batch, callbacks=callbacks, verbose=False)

In the above line what are the values for n_ctx and n_batch that you guys are using?

nishanth-k-10 avatar Aug 28 '23 05:08 nishanth-k-10

Hello, to put it simply, when I ask a question to GPT, it takes ages to respond, and the processor or GPU is not being utilized at 100% or even 50%. Capture d’écran du 2023-06-12 20-00-14

Hey - can I ask, how are you getting this cli monitoring setup? I want to get that going on my pc.

mattehicks avatar Aug 28 '23 16:08 mattehicks

Try to increase the value for n_thread parameter. For example if you have 8 cores and 2 threads per core then you can put max value up to 8*2=16 threads. Just don't give all

nishanth-k-10 avatar Aug 30 '23 04:08 nishanth-k-10

llm = GPT4All

that's the spirit! Nice!

CRPrinzler avatar Jan 16 '24 09:01 CRPrinzler

https://github.com/imartinez/privateGPT/pull/1589

lolo9538 avatar Feb 13 '24 19:02 lolo9538