private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

Time from prompt to response is too long! Very Slow?

Open ekolawole opened this issue 1 year ago • 19 comments

So I setup on 128GB RAM and 32 cores. I also used wizard vicuna for the llm model. I noticed that no matter the parameter size of the model, either 7b, 13b, 30b, etc, the prompt takes too long to generate a reply? I ingested a 4,000KB txt book, which took 6 minutes to ingest into the cromadb, so I was very happy that it does not take too long to ingest anymore. Unfortunately, when I enter a prompt I have to wait almost a minute before I get a reply. My PC is very powerful and lots of space on the RAM and CPU, because nothing else is running.

How can we speed this up? There needs to be an option to allow more PC resources to be used to improve prompt speed!

ekolawole avatar May 19 '23 21:05 ekolawole

The slow speed during interaction is mostly caused by LLM. I can see that the default num threads (param n_threads) for the LLM are 4.

match model_type:
    case "LlamaCpp":
        llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks, verbose=False)
    case "GPT4All":
        llm = GPT4All(model=model_path, n_ctx=model_n_ctx, backend='gptj', callbacks=callbacks, verbose=False)
    case _default:
        print(f"Model {model_type} not supported!")
        exit;

You can increase the speed of your LLM model by putting n_threads=16 or more to whatever you want to speed up your inferencing

 case "LlamaCpp":
        llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks, verbose=False, n_threads=16)

tanhm12 avatar May 20 '23 04:05 tanhm12

So I setup on 128GB RAM and 32 cores. I also used wizard vicuna for the llm model. I noticed that no matter the parameter size of the model, either 7b, 13b, 30b, etc, the prompt takes too long to generate a reply? I ingested a 4,000KB txt book, which took 6 minutes to ingest into the cromadb, so I was very happy that it does not take too long to ingest anymore. Unfortunately, when I enter a prompt I have to wait almost a minute before I get a reply. My PC is very powerful and lots of space on the RAM and CPU, because nothing else is running.

How can we speed this up? There needs to be an option to allow more PC resources to be used to improve prompt speed!

Hi ,

Could you let me know where can I download the correct version to run privateGPT?

Using embedded DuckDB with persistence: data will be stored in: db
gptj_model_load: loading model from 'models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin' - please wait ...
gptj_model_load: invalid model file 'models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin' (bad magic)

I trust I downloaded the wrong version.

laihenyi avatar May 20 '23 07:05 laihenyi

Hi ,

Could you let me know where can I download the correct version to run privateGPT?

Using embedded DuckDB with persistence: data will be stored in: db
gptj_model_load: loading model from 'models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin' - please wait ...
gptj_model_load: invalid model file 'models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin' (bad magic)

I trust I downloaded the wrong version.

Have you tried the vicuna model in the gpt4all page yet? https://gpt4all.io/index.html

tanhm12 avatar May 20 '23 08:05 tanhm12

Hi , Could you let me know where can I download the correct version to run privateGPT?

Using embedded DuckDB with persistence: data will be stored in: db
gptj_model_load: loading model from 'models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin' - please wait ...
gptj_model_load: invalid model file 'models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin' (bad magic)

I trust I downloaded the wrong version.

Have you tried the vicuna model in the gpt4all page yet? https://gpt4all.io/index.html

Using embedded DuckDB with persistence: data will be stored in: db gptj_model_load: loading model from 'models/ggml-vicuna-13b-1.1-q4_2.bin' - please wait ... gptj_model_load: invalid model file 'models/ggml-vicuna-13b-1.1-q4_2.bin' (bad magic)

NO luck ...

laihenyi avatar May 20 '23 09:05 laihenyi

@laihenyi See https://github.com/imartinez/privateGPT/issues/276#issuecomment-1554262627

PulpCattel avatar May 20 '23 09:05 PulpCattel

@laihenyi See #276 (comment)

Thanks ... Getting closer ... However, Something still went wrong.

Using embedded DuckDB with persistence: data will be stored in: db
llama_model_load: loading model from 'models/ggml-vicuna-13b-1.1-q4_2.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 1000
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 5
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: type    = 2
llama_model_load: invalid model file 'models/ggml-vicuna-13b-1.1-q4_2.bin' (bad f16 value 5)
llama_init_from_file: failed to load model
Segmentation fault: 11

any idea ?

laihenyi avatar May 20 '23 11:05 laihenyi

Sorry, not sure. I don't have that model to test, but the 13b-q5 works with MODEL_TYPE=LlamaCpp. It would also be good if you specify exactly what you did to get that output, and if other models work for you.

PulpCattel avatar May 20 '23 11:05 PulpCattel

ingest is lighting fast now. Answering questions is much slower. Took about 4-5 minutes to answer a question. The answer is total nonsense which starts looking into my DND pdf but ends with talking about heart disease. WTH

PERSIST_DIRECTORY=db MODEL_TYPE=LlamaCpp MODEL_PATH=models/koala-7b.ggml.unquantized.pr613.bin EMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2 MODEL_N_CTX=1000

jcrsantiago avatar May 20 '23 20:05 jcrsantiago

use n_threads and set it to 30 or sometjing in the functions that have "embeddings"...

maozdemir avatar May 21 '23 11:05 maozdemir

use n_threads and set it to 30 or sometjing in the functions that have "embeddings"...

Can you explain in more detail? Where am I making these changes? Perhaps a screenshot or copy paste an example?

jcrsantiago avatar May 21 '23 16:05 jcrsantiago

Hi , Could you let me know where can I download the correct version to run privateGPT?

Using embedded DuckDB with persistence: data will be stored in: db
gptj_model_load: loading model from 'models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin' - please wait ...
gptj_model_load: invalid model file 'models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin' (bad magic)

I trust I downloaded the wrong version.

Have you tried the vicuna model in the gpt4all page yet? https://gpt4all.io/index.html

Haven't found any model that works with MODEL_TYPE=LlamaCpp on https://gpt4all.io/index.html. Anyone can suggest where to find a model that works with LlamaCpp?

lilinwang avatar May 22 '23 05:05 lilinwang

Hi , Could you let me know where can I download the correct version to run privateGPT?

Using embedded DuckDB with persistence: data will be stored in: db
gptj_model_load: loading model from 'models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin' - please wait ...
gptj_model_load: invalid model file 'models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin' (bad magic)

I trust I downloaded the wrong version.

Have you tried the vicuna model in the gpt4all page yet? https://gpt4all.io/index.html

Haven't found any model that works with MODEL_TYPE=LlamaCpp on https://gpt4all.io/index.html. Anyone can suggest where to find a model that works with LlamaCpp?

My issue is resolved by following: https://github.com/imartinez/privateGPT/issues/220

lilinwang avatar May 22 '23 05:05 lilinwang

I've tried it on a Intel(R) Xeon(R) E-2236 CPU @ 3.40GHz with 64G of RAM, was faster than my initial testing on an I5 with 32G RAM. when it does answer, it answers well based on the union training text.

While OpenChatKit will run on a 4GB GPU (slowly!) and performs better on a 12GB GPU, I don't have the resources to train it on 8 x A100 GPUs. So I love the idea of this bot and how it can be easily trained from private data with low resources. I don't care really how long it takes to train, but would like snappier answer times. I'm only new to AI and python, so cannot contribute anything of real value yet but I'm working on it!.

Would be nice to train on CPU and inference on GPU, anyway, thanks for creating this!

cheers

darrinh avatar May 22 '23 07:05 darrinh

@jcrsantiago to add threads just change it in privateGPT.py file:

llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks, verbose=False, n_threads=8)

See n_threads= added at the end.

PulpCattel avatar May 22 '23 08:05 PulpCattel

@jcrsantiago to add threads just change it in privateGPT.py file:

llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks, verbose=False, n_threads=8)

See n_threads= added at the end.

I got a not permitted error after adding n_threads

zaramal avatar May 22 '23 11:05 zaramal

https://huggingface.co/TheBloke/wizard-vicuna-13B-GGML/tree/main

Find other LLM models here

ekolawole avatar May 23 '23 04:05 ekolawole

I tested on host (72G RAM, 36 cores AWS EC2 - c5.9xlarge), ingesting is super fast. On inference, I did not get the dreaded unknown token messages, so I assume those messages are the result of not enough memory. the .env file:

PERSIST_DIRECTORY=db
MODEL_TYPE=GPT4All
MODEL_PATH=models/ggml-gpt4all-j-v1.3-groovy.bin
EMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2

The first test was against the provided text in the repo with no other alterations using the default model. The second test was run setting the n_threads to 32.

llm = GPT4All(model=model_path, n_ctx=model_n_ctx, backend='gptj', callbacks=callbacks, verbose=False,n_threads=32)

The question for both tests was:

"how will inflation be handled?"

Test 1 time: 1 minute 57 seconds Test 2 time: 1 minute 58 seconds

The same question on an I5 with 32G of RAM: 1 minute 24 seconds

Performance seemed slower with more resources, not sure if its because the larger host was a VM? The larger host streamed the answer faster than the I5 test but took longer to 'think' about it.

Is there some kind of internal constraint that could be adjusted?

darrinh avatar May 24 '23 03:05 darrinh

Update to the latest version of langchain and the n_threads will be respected.

JasonMaggard avatar Jun 07 '23 18:06 JasonMaggard

I can confirm a similar behavior. I have t14s with Ryzen 7 (16 threads). By default, privateGPT utilizes 4 threads, and queries are answered in 180s on average. With 8 threads they are answered in 90s. With 12/16 threads it slows down by circa 20 seconds.

jaceksan avatar Jun 13 '23 13:06 jaceksan

Update to the latest version of langchain and the n_threads will be respected.

How do you update it? Apparently I have no improvements by increasing the threads...

massigarg avatar Jun 30 '23 18:06 massigarg

Update to the latest version of langchain and the n_threads will be respected.

How do you update it? Apparently I have no improvements by increasing the threads...

just do git pull

maozdemir avatar Jun 30 '23 20:06 maozdemir

I confirm the same, I'm running it in physical server (Intel® Xeon(R) CPU E5-2620 0 @ 2.00GHz × 24) (16 GB RAM). It's very slow and it took 317.0 s to answer.

Here I'm attaching terminal output if it can help.

python3 privateGPT.py Using embedded DuckDB with persistence: data will be stored in: db Found model file at models/ggml-gpt4all-j-v1.3-groovy.bin gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ... gptj_model_load: n_vocab = 50400 gptj_model_load: n_ctx = 2048 gptj_model_load: n_embd = 4096 gptj_model_load: n_head = 16 gptj_model_load: n_layer = 28 gptj_model_load: n_rot = 64 gptj_model_load: f16 = 2 gptj_model_load: ggml ctx size = 5401.45 MB gptj_model_load: kv self size = 896.00 MB gptj_model_load: ................................... done gptj_model_load: model size = 3609.38 MB / num tensors = 285

image

msatti avatar Jul 12 '23 05:07 msatti

I confirm the same, I'm running it in physical server (Intel® Xeon(R) CPU E5-2620 0 @ 2.00GHz × 24) (16 GB RAM). It's very slow and it took 317.0 s to answer.

Here I'm attaching terminal output if it can help.

python3 privateGPT.py Using embedded DuckDB with persistence: data will be stored in: db Found model file at models/ggml-gpt4all-j-v1.3-groovy.bin gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ... gptj_model_load: n_vocab = 50400 gptj_model_load: n_ctx = 2048 gptj_model_load: n_embd = 4096 gptj_model_load: n_head = 16 gptj_model_load: n_layer = 28 gptj_model_load: n_rot = 64 gptj_model_load: f16 = 2 gptj_model_load: ggml ctx size = 5401.45 MB gptj_model_load: kv self size = 896.00 MB gptj_model_load: ................................... done gptj_model_load: model size = 3609.38 MB / num tensors = 285

image

Interesting! Did you try to use various n_threads? As I mentioned above, for 8 threads it's 2x faster than default, but more threads are slowing down everything more and more...

jaceksan avatar Jul 12 '23 08:07 jaceksan

Interesting! Did you try to use various n_threads? As I mentioned above, for 8 threads it's 2x faster than default, but more threads are slowing down everything more and more...

I did use the n_threads=20, here is my code:

match model_type:
        case "LlamaCpp":
            llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, n_batch=model_n_batch, callbacks=callbacks, verbose=False, n_threads=8)
        case "GPT4All":s
            llm = GPT4All(model=model_path, n_ctx=model_n_ctx, backend='gptj', n_batch=model_n_batch, callbacks=callbacks, verbose=False, n_threads=8)
        case _default:
            # raise exception if model_type is not supported
            raise Exception(f"Model type {model_type} is not supported. Please choose one of the following: LlamaCpp, GPT4All")
    

After your comment I changed it to 8: first answer took 219 s. second answer took 677.69 s!

msatti avatar Jul 12 '23 11:07 msatti

It seems that if n_threads > 8, it will increase the time response, I have the same issues here

bxdoan avatar Aug 15 '23 12:08 bxdoan

After adding n_threads to the privateGPT.py, it seems I'm only able to get the response to my question after hitting Ctrl C. I'm using a Mac. Screenshot 2024-04-02 at 6 20 59 PM

This is my .env file Screenshot 2024-04-02 at 6 04 51 PM

By the way, I've tried using nous-hermes-llama2-13b.ggmlv3.q3_K_M.bin. After installing it, I couldn't get it to work. Any form of assistance would be really appreciated.🙏🏾

oluwabunmifife avatar Apr 02 '24 17:04 oluwabunmifife

From my research, I believe using nous-hermes-llama2-13b.ggmlv3.q3_K_M.bin would get me a quicker response but I cannot figure out how to make it work. I'm referring to the .env file. A nudge in the right direction will be appreciated.

oluwabunmifife avatar Apr 02 '24 17:04 oluwabunmifife