private-gpt VERY BIG performance improvement and beautiful features

trafficstars

Fixed an issue that made the evaluation of the user input prompt extremely slow, this brought a monstrous increase in performance, about 5-6 times faster.
Added a script to install CUDA-accelerated requirements
Added the OpenAI model (it may go outside the scope of this repository, so I can remove it if necessary)
Added some additional flags in the .env
Changed the embedder template to a better performing template
Bumped some versione, like llama-cpp-python and langchain (WARNING with the new version of llama.cpp, models are more powerful, but old models are now completely incompatible, if necessary, you can downgrade)
I removed the state-of-union example file because someone who does not see it might leave it there and it would bother them during queries
Added auto translation (Perhaps it should be removed because it uses an Internet connection)

May 29 '23 06:05 DanielusG

How does adding n_gpu_layers and use_mlock helps in performance?

llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks,verbose=False,n_gpu_layers=n_gpu_layers, use_mlock=use_mlock,top_p=0.9, n_batch=1024)

May 29 '23 06:05 8bitaby

How does adding n_gpu_layers and use_mlock helps in performance?

llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks,verbose=False,n_gpu_layers=n_gpu_layers, use_mlock=use_mlock,top_p=0.9, n_batch=1024)

if the user have a Nvidia GPU, part of the model will be offloaded on gpu, and it accelerate things. mlock prevent disk read, so more performance. I added them as options in the .env precisely to let the user choose whether he wants these improvements or not

May 29 '23 08:05 DanielusG

Thanks for the contribution! Looking forward to merging this! Some comments/questions:

Could you elaborate on the "issue that made the prompt evaluation slow" that you fixed?
I'd leave OpenAI out of this repo for the moment. It doesn't fit well the very purpose of it.
I guess some updates to the readme would be required.

Thanks again!

May 29 '23 10:05 imartinez

Could you elaborate on the "issue that made the prompt evaluation slow" that you fixed?

I had also opened an issue about it #493 Basically, after a while of debugging, I ran the code in debug mode and analysed line by line how far the programme would get stuck for a long time, and after quite a while I discovered that when the langchain library made the call to LlamaCpp, it passed it the parameter n_batch = 8

For those who don't know, n_batch indicates how many tokens at a time are to be processed by the llama context, being so low, with a context of 1000 tokens, a for loop of 1000/8 times was therefore executed, which is extremely slow and heavy. In fact, the default value used in the main repository of llama.cpp is 512. After some experimentation, I noticed that 1024 seems a good value.

I'd leave OpenAI out of this repo for the moment. It doesn't fit well the very purpose of it.

Yes, this is perfectly fine, in fact I even wrote it above. I simply left it as I was having a bad performance with llama 13b and wanted to test whether GPT 3.5 was better.

In the end I solved it by using vicuna 13b 1.1 q5_1

I guess some updates to the readme would be required.

Yes, i think too, if you want i can do it, but my english is not so good hahah

May 29 '23 12:05 DanielusG

@imartinez as I understand that many people may not have sufficient computing power to run this code, if you want I can create a new branch in my fork and leave openAI active. What do you think?

May 29 '23 12:05 DanielusG

for reference, after running the requirements, I still had to install the following (on clean environment):

python -m pip install python-dotenv
pip install tqdm
pip install langchain
pip install chromadb
pip install sentence_transformers
pip install pip install sentence_transformers
pip install llama-cpp-python

the last resulted in:

nvcc fatal : Value 'native' is not defined for option 'gpu-architecture'

running I5/32G RAM/Nvidia Titan 12GB VRAM.

nvcc --list-gpu-arch:

compute_35 compute_37 compute_50 compute_52 compute_53 compute_60 compute_61 compute_62 compute_70 compute_72 compute_75 compute_80 compute_86 compute_87

May 30 '23 07:05 darrinh

for reference, after running the requirements, I still had to install the following (on clean environment):

python -m pip install python-dotenv

pip install tqdm

pip install langchain

pip install chromadb

pip install sentence_transformers

pip install pip install sentence_transformers

pip install llama-cpp-python

the last resulted in:

nvcc fatal : Value 'native' is not defined for option 'gpu-architecture'

running I5/32G RAM/Nvidia Titan 12GB VRAM.

nvcc --list-gpu-arch:

compute_35 compute_37 compute_50 compute_52 compute_53 compute_60 compute_61 compute_62 compute_70 compute_72 compute_75 compute_80 compute_86 compute_87

Did you use the bash script?

If so, before starting the script you must execute: source ./venv/bin/activate.sh for activate the local enviroment.

May 30 '23 08:05 DanielusG

If so, before starting the script you must execute: source ./venv/bin/activate.sh for activate the local environment.

Some may use conda so it would go:

#!/bin/bash
export LLAMA_CUBLAS=1
source ~/anaconda3/bin/activate
#check if venv virtual env exists
if conda info --envs | grep -q "venv" 
then
  echo "env already exists"
  conda activate /usr/local/anaconda3/envs/venv
else
  conda create -y -n "venv"
  conda activate /usr/local/anaconda3/envs/venv
  pip3 install -r requirements.txt
fi

echo "Done! Active envs:"
conda info --envs

May 30 '23 08:05 sime2408

@DanielusG I added n_batch=2000 and the performance increase was phenomenal! You are right and I am blown away. The prompt eval time moved from

This:
llama_print_timings: prompt eval time = 24568.61 ms / 650 tokens ( 37.80 ms per token)

To: llama_print_timings: prompt eval time = 3349.03 ms / 651 tokens ( 5.14 ms per token)

If I am not mistaken 87% increase in speed. It moved from 24 seconds to 3 Seconds.

Just to clarify for those who may be interested in terms of performance based on the actual graphics card. I am using a 12GB 3060.

@imartinez This is definitely something that should be looked at. This makes it so much more useable at a different level.

May 30 '23 11:05 StephenDWright

If I am not mistaken 87% increase in speed. It moved from 24 seconds to 3 Seconds

Yes, you are right, but expressed like this doesn't quite give the idea, it's not just 87.5% faster, it's 8 times faster, so it takes 8 times less time than before. Now it makes the idea better :smile:

May 30 '23 12:05 DanielusG

If I am not mistaken 87% increase in speed. It moved from 24 seconds to 3 Seconds

Yes, you are right, but expressed like this doesn't quite give the idea, it's not just 87.5% faster, it's 8 times faster, so it takes 8 times less time than before. Now it makes the idea better 😄

Haha, fair enough. It is true. The percentage doesn't give the true feeling. It really is that drastic a difference. I realized like you say, it doesn't need to be as high as I put it. I achieved a similar speed with n_batch=1024 vs n_batch=2000

Oh this also makes it possible to use 13b quantized models. It is slightly slower, by a second or two than the 7b I was using before but of course gives even better answers.

May 30 '23 13:05 StephenDWright

Thanks for the detailed info @DanielusG! I'll be running some more tests before merging, feel free to keep it as a branch on your repo and evolve it further, we'll definitely be using your contributions down the line! My Mac M1 crashes with n_batch > 16... so the limitation for certain computers is quite real. I will be making the readme more informational so different users can optimize for their use cases and machines.

May 30 '23 14:05 imartinez

Thanks for the detailed info @DanielusG! I'll be running some more tests before merging, feel free to keep it as a branch on your repo and evolve it further, we'll definitely be using your contributions down the line! My Mac M1 crashes with n_batch > 16... so the limitation for certain computers is quite real. I will be making the readme more informational so different users can optimize for their use cases and machines.

@imartinez is it possible that it crashes due to the small amount of RAM available? I guess the M1 has 8 or at most 16 GB of Ram if it is the laptop. on my laptop (linux arch), vicuna 13b 1.1 q5_1 use 21GB of RAM.

And as I also wrote in the description, increasing that value can result in high resource usage.

In any case, is the translator I integrated OK? I guess it goes against the purpose of this repository since it uses google translator, and thus a connection (consider that the user can choose to disable this feature from the .env), so if I have to revert the merge, I'll do it without any problems!

May 30 '23 15:05 DanielusG

for reference, after running the requirements, I still had to install the following (on clean environment):

python -m pip install python-dotenv

pip install tqdm

pip install langchain

pip install chromadb

pip install sentence_transformers

pip install pip install sentence_transformers

pip install llama-cpp-python

the last resulted in: nvcc fatal : Value 'native' is not defined for option 'gpu-architecture' running I5/32G RAM/Nvidia Titan 12GB VRAM. nvcc --list-gpu-arch: compute_35 compute_37 compute_50 compute_52 compute_53 compute_60 compute_61 compute_62 compute_70 compute_72 compute_75 compute_80 compute_86 compute_87

Did you use the bash script?

If so, before starting the script you must execute: source ./venv/bin/activate.sh for activate the local enviroment.

I didn't use the script directly, my steps were:

git clone https://github.com/imartinez/privateGPT.git export LLAMA_CUBLAS=1 python -m venv create privateGPT source privateGPT/bin/activate

then i need to do: pip install llama-cpp-python which results in nvcc fatal : Value 'native' is not defined for option 'gpu-architecture'

I should say its on ubuntu 22.04.

thanks

May 30 '23 22:05 darrinh

#!/bin/bash
export LLAMA_CUBLAS=1
source ~/anaconda3/bin/activate
#check if venv virtual env exists
if conda info --envs | grep -q "venv" 
then
  echo "env already exists"
  conda activate /usr/local/anaconda3/envs/venv
else
  conda create -y -n "venv"
  conda activate /usr/local/anaconda3/envs/venv
  pip3 install -r requirements.txt
fi

echo "Done! Active envs:"
conda info --envs

The above and running ./install_cuda.sh result in the same error. I still get ' nvcc fatal : Value 'native' is not defined for option 'gpu-architecture'', so there must be some missing library or similar causing the issue on my environment?

full trace -> https://pastebin.com/iBaAhX7n

I5/32GB/Nvidia titan 12GB thanks

May 30 '23 22:05 darrinh

this is where it fails:

        g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c llama.cpp -o llama.o
      cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include   -c ggml.c -o ggml.o
      nvcc --forward-unknown-to-host-compiler -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_DMMV_Y=1 -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
      nvcc fatal   : Value 'native' is not defined for option 'gpu-architecture'
      make: *** [Makefile:147: ggml-cuda.o] Error 1
      ninja: build stopped: subcommand failed.

May 31 '23 01:05 darrinh

ok, installing the latest nvidia toolkit (12.1) has allowed llama-cpp-python to build correctly, seem the ubuntu packages are somewhat out of date.

also had to edit /etc/security/limits.conf to raise the memlock limit.

May 31 '23 02:05 darrinh

ok, got it working with n_batch 2000, not as fast as a previous poster but better than before

Using embedded DuckDB with persistence: data will be stored in: db
llama.cpp: loading model from models/ggml-vicuna-13B-1.1-q5_1.bin.3
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 8636.08 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 12 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 2722 MB
....................................................................................................
llama_init_from_file: kv self size  = 3200.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 

Enter a query: how do we fight inflation?
 One way to fight inflation is by building a better America through investments in infrastructure, education, and job training programs that increase the productive capacity of our economy, as discussed in the given text. Another way is by cutting costs and making it easier for businesses to produce goods and services efficiently, as mentioned in the plan proposed by the
llama_print_timings:        load time = 17124.50 ms
llama_print_timings:      sample time =    40.61 ms /    73 runs   (    0.56 ms per token)
llama_print_timings: prompt eval time = 17124.31 ms /  1031 tokens (   16.61 ms per token)
llama_print_timings:        eval time = 47569.89 ms /    72 runs   (  660.69 ms per token)
llama_print_timings:       total time = 67293.34 ms

May 31 '23 03:05 darrinh

where the previous result had:

llama_model_load_internal: [cublas] offloading 12 layers to GPU llama_model_load_internal: [cublas] total VRAM used: 2722 MB

is it GPU dependent, the one used is 12GB, would a larger GPU help? Can it run across multiple GPUs?

May 31 '23 04:05 darrinh

where the previous result had:

llama_model_load_internal: [cublas] offloading 12 layers to GPU llama_model_load_internal: [cublas] total VRAM used: 2722 MB

is it GPU dependent, the one used is 12GB, would a larger GPU help? Can it run across multiple GPUs?

Glad to hear you managed to get it working, why don't you increase the number of offload layers? having 12 gb of vram you can load most of the model into the VRAM and it should give you more performance.

As for using multiGPU I don't think it's feasible for now, you should have a look at the llama.cpp project and see if there are any updates there. llama-cpp-python is just a library that converts python to cpp, the rest is done by llamac.pp

May 31 '23 04:05 DanielusG

thanks DanielusG, I tried increasing the layers, the timings didn't change much

24 layers
llama_print_timings:        load time = 16568.45 ms
llama_print_timings:      sample time =    36.19 ms /    64 runs   (    0.57 ms per token)
llama_print_timings: prompt eval time = 16568.28 ms /  1031 tokens (   16.07 ms per token)
llama_print_timings:        eval time = 26956.10 ms /    63 runs   (  427.87 ms per token)
llama_print_timings:       total time = 45821.06 ms

40 layers
llama_print_timings:        load time = 16183.76 ms
llama_print_timings:      sample time =    53.52 ms /    91 runs   (    0.59 ms per token)
llama_print_timings: prompt eval time = 16183.58 ms /  1031 tokens (   15.70 ms per token)
llama_print_timings:        eval time = 33001.97 ms /    90 runs   (  366.69 ms per token)
llama_print_timings:       total time = 52487.18 ms

If I try more than 40, it seems to default back to 40 layers.

Using embedded DuckDB with persistence: data will be stored in: db
llama.cpp: loading model from models/ggml-vicuna-13B-1.1-q5_1.bin.3
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 2282.48 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 9076 MB
....................................................................................................
llama_init_from_file: kv self size  = 3200.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

Maybe my card is just old and slow? might check the bios settings, could also just be not a great motherboard as well.

May 31 '23 05:05 darrinh

thanks DanielusG, I tried increasing the layers, the timings didn't change much

24 layers
llama_print_timings:        load time = 16568.45 ms
llama_print_timings:      sample time =    36.19 ms /    64 runs   (    0.57 ms per token)
llama_print_timings: prompt eval time = 16568.28 ms /  1031 tokens (   16.07 ms per token)
llama_print_timings:        eval time = 26956.10 ms /    63 runs   (  427.87 ms per token)
llama_print_timings:       total time = 45821.06 ms

40 layers
llama_print_timings:        load time = 16183.76 ms
llama_print_timings:      sample time =    53.52 ms /    91 runs   (    0.59 ms per token)
llama_print_timings: prompt eval time = 16183.58 ms /  1031 tokens (   15.70 ms per token)
llama_print_timings:        eval time = 33001.97 ms /    90 runs   (  366.69 ms per token)
llama_print_timings:       total time = 52487.18 ms

If I try more than 40, it seems to default back to 40 layers.

Using embedded DuckDB with persistence: data will be stored in: db
llama.cpp: loading model from models/ggml-vicuna-13B-1.1-q5_1.bin.3
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 2282.48 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 9076 MB
....................................................................................................
llama_init_from_file: kv self size  = 3200.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

Maybe my card is just old and slow? might check the bios settings, could also just be not a great motherboard as well.

No, it's not your slow video card. Llama.cpp was born to run models on CPUs, recently introduced acceleration with the GPU, but it's just speeding up some types of computation, it's not running the whole model on the GPU, so llama.cpp it's still CPU intensive. So your problem will definitely be limited by your CPU. In my fork of this project I created a branch with HuggingFace hard-coded, that is, I wrote on the fly the implementation to load the models not in ggml format but in native model (so you have to download a model again if you don't have it), I do not remember if I did the commit to load them in 4bit using double quantization. In any case, be careful, that branch I did not create with the public in mind, but only try on my 6GB GPU of VRAM what it could do. But above all I did not care to fix it since today I have the physics exam at the university and I was preparing 😬 As soon as I have some time I'll fix that branch

May 31 '23 06:05 DanielusG

Haven't tried this one but I was able to run the original(privateGPT) without problem on my mac M1 with 8G. Question: what would be the configuration in running this on my mac M1 with only have 8G?

Jun 01 '23 12:06 marzim

Haven't tried this one but I was able to run the original(privateGPT) without problem on my mac M1 with 8G. Question: what would be the configuration in running this on my mac M1 with only have 8G?

It appears that the way in which llama.cpp loads and processes the model on M1 processors is different than on normal processors. So privateGPT without this pull request should still work fine for you. Unfortunately I don't have a MacBook available to give you any information, sorry

Jun 01 '23 14:06 DanielusG

Any talks to run cuda inside docker? heard somewhere that it's possible:

Nvidia CUDA in a Docker container:
1. run nvidia-smi on host, it needs to run successfully
2. install nvidia-container-toolkit
3. restart Docker process
4. run a test container like so

docker run --gpus all nvidia/cuda:12.1.1-base-ubi8 nvidia-smi

Should be same as 1.

Once it works for you, you can

pull this PR (not merged, but working): https://github.com/ggerganov/llama.cpp/pull/1461
download a model from here, I tried with the smallest one:
https://huggingface.co/gotzmann/LLaMA-GGML-v2/tree/main
to create a Docker image locally, there is a description of how to do it in PR
start the process like this docker run --rm --gpus all -v ~/Development/LLM/Models/:/models local/llama.cpp:light-cuda -m /models/llama-7b-ggml-v2-q4_0.bin -p " Here's a haiku about a rotten banana" -n 512 --n-gpu-layers 1

(changed path to dir with models, clearly)

https://docs.docker.com/config/containers/resource_constraints/

Jun 02 '23 08:06 sime2408

Hi @DanielusG , I'm interested if you are of keeping an OpenAI branch to try out. Would need the readme updated on that branch to point out the MODEL_TYPE=OpenAI and anything else. Thanks.

Jun 04 '23 04:06 ACoderLife

@sime2408

Any talks to run cuda inside docker? heard somewhere that it's possible

This is definitely possible. I've used the tech you mention to deploy instant-ngp in a restricted environment that ran an older OS. Performance was great, and forwarding the UI out of the container was also possible if you have a need for a GUI.

Interestingly, it's also used by cog to streamline the deployment of ML models via docker containers. It attempts to make the packaging of the dependencies less of a headache. Not sure if that system would suit the needs of this repo, but could be worth a look as well.

Jun 04 '23 21:06 JasonGilholme

I'm not a maintainer but I think it would be super helpful if you separate out all your changes and create separate PRs. That'll make it easier to test/evaluate in isolation and speed up merging!

For example, one PR just for performance improvements. One PR for translation. One PR for removing the example text. etc. etc.

Jun 06 '23 05:06 riverar

I'm not a maintainer but I think it would be super helpful if you separate out all your changes and create separate PRs. That'll make it easier to test/evaluate in isolation and speed up merging!

For example, one PR just for performance improvements. One PR for translation. One PR for removing the example text. etc. etc.

I am not yet good with github, what would you suggest I do? close this PR and open several with individual features? How do I remove merge changes in my master branch? Thanks for your patience :smiling_face_with_tear:

Jun 06 '23 05:06 DanielusG

@DanielusG you can create multiple pull requests (PRs) to the original repository from different branches of your forked repository. Each branch of your forked repository can have its own PR to the original repository

To create branches is super easy:

After cloning, navigate to the repository's directory using the command cd REPO-NAME and then create a new branch using the command git checkout -b BRANCH-NAME.

Commit, push, create PR

Jun 06 '23 06:06 sime2408

private-gpt private-gpt copied to clipboard

VERY BIG performance improvement and beautiful features

private-gpt
private-gpt copied to clipboard