private-gpt
private-gpt copied to clipboard
VERY BIG performance improvement and beautiful features
- Fixed an issue that made the evaluation of the user input prompt extremely slow, this brought a monstrous increase in performance, about 5-6 times faster.
- Added a script to install CUDA-accelerated requirements
- Added the OpenAI model (it may go outside the scope of this repository, so I can remove it if necessary)
- Added some additional flags in the .env
- Changed the embedder template to a better performing template
- Bumped some versione, like llama-cpp-python and langchain (WARNING with the new version of llama.cpp, models are more powerful, but old models are now completely incompatible, if necessary, you can downgrade)
- I removed the state-of-union example file because someone who does not see it might leave it there and it would bother them during queries
- Added auto translation (Perhaps it should be removed because it uses an Internet connection)
How does adding n_gpu_layers and use_mlock helps in performance?
llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks,verbose=False,n_gpu_layers=n_gpu_layers, use_mlock=use_mlock,top_p=0.9, n_batch=1024)
How does adding n_gpu_layers and use_mlock helps in performance?
llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks,verbose=False,n_gpu_layers=n_gpu_layers, use_mlock=use_mlock,top_p=0.9, n_batch=1024)
if the user have a Nvidia GPU, part of the model will be offloaded on gpu, and it accelerate things. mlock prevent disk read, so more performance. I added them as options in the .env precisely to let the user choose whether he wants these improvements or not
Thanks for the contribution! Looking forward to merging this! Some comments/questions:
-
Could you elaborate on the "issue that made the prompt evaluation slow" that you fixed?
-
I'd leave OpenAI out of this repo for the moment. It doesn't fit well the very purpose of it.
-
I guess some updates to the readme would be required.
Thanks again!
Could you elaborate on the "issue that made the prompt evaluation slow" that you fixed?
I had also opened an issue about it #493 Basically, after a while of debugging, I ran the code in debug mode and analysed line by line how far the programme would get stuck for a long time, and after quite a while I discovered that when the langchain library made the call to LlamaCpp, it passed it the parameter n_batch = 8
For those who don't know, n_batch indicates how many tokens at a time are to be processed by the llama context, being so low, with a context of 1000 tokens, a for loop of 1000/8 times was therefore executed, which is extremely slow and heavy. In fact, the default value used in the main repository of llama.cpp is 512. After some experimentation, I noticed that 1024 seems a good value.
I'd leave OpenAI out of this repo for the moment. It doesn't fit well the very purpose of it.
Yes, this is perfectly fine, in fact I even wrote it above. I simply left it as I was having a bad performance with llama 13b and wanted to test whether GPT 3.5 was better.
In the end I solved it by using vicuna 13b 1.1 q5_1
I guess some updates to the readme would be required.
Yes, i think too, if you want i can do it, but my english is not so good hahah
@imartinez as I understand that many people may not have sufficient computing power to run this code, if you want I can create a new branch in my fork and leave openAI active. What do you think?
for reference, after running the requirements, I still had to install the following (on clean environment):
- python -m pip install python-dotenv
- pip install tqdm
- pip install langchain
- pip install chromadb
- pip install sentence_transformers
- pip install pip install sentence_transformers
- pip install llama-cpp-python
the last resulted in:
nvcc fatal : Value 'native' is not defined for option 'gpu-architecture'
running I5/32G RAM/Nvidia Titan 12GB VRAM.
nvcc --list-gpu-arch:
compute_35 compute_37 compute_50 compute_52 compute_53 compute_60 compute_61 compute_62 compute_70 compute_72 compute_75 compute_80 compute_86 compute_87
for reference, after running the requirements, I still had to install the following (on clean environment):
- python -m pip install python-dotenv
- pip install tqdm
- pip install langchain
- pip install chromadb
- pip install sentence_transformers
- pip install pip install sentence_transformers
- pip install llama-cpp-python
the last resulted in:
nvcc fatal : Value 'native' is not defined for option 'gpu-architecture'
running I5/32G RAM/Nvidia Titan 12GB VRAM.
nvcc --list-gpu-arch:
compute_35 compute_37 compute_50 compute_52 compute_53 compute_60 compute_61 compute_62 compute_70 compute_72 compute_75 compute_80 compute_86 compute_87
Did you use the bash script?
If so, before starting the script you must execute:
source ./venv/bin/activate.sh
for activate the local enviroment.
If so, before starting the script you must execute:
source ./venv/bin/activate.shfor activate the local environment.
Some may use conda so it would go:
#!/bin/bash
export LLAMA_CUBLAS=1
source ~/anaconda3/bin/activate
#check if venv virtual env exists
if conda info --envs | grep -q "venv"
then
echo "env already exists"
conda activate /usr/local/anaconda3/envs/venv
else
conda create -y -n "venv"
conda activate /usr/local/anaconda3/envs/venv
pip3 install -r requirements.txt
fi
echo "Done! Active envs:"
conda info --envs
@DanielusG I added n_batch=2000 and the performance increase was phenomenal! You are right and I am blown away. The prompt eval time moved from
This:
llama_print_timings: prompt eval time = 24568.61 ms / 650 tokens ( 37.80 ms per token)
To: llama_print_timings: prompt eval time = 3349.03 ms / 651 tokens ( 5.14 ms per token)
If I am not mistaken 87% increase in speed. It moved from 24 seconds to 3 Seconds.
Just to clarify for those who may be interested in terms of performance based on the actual graphics card. I am using a 12GB 3060.
@imartinez This is definitely something that should be looked at. This makes it so much more useable at a different level.
If I am not mistaken 87% increase in speed. It moved from 24 seconds to 3 Seconds
Yes, you are right, but expressed like this doesn't quite give the idea, it's not just 87.5% faster, it's 8 times faster, so it takes 8 times less time than before. Now it makes the idea better :smile:
If I am not mistaken 87% increase in speed. It moved from 24 seconds to 3 Seconds
Yes, you are right, but expressed like this doesn't quite give the idea, it's not just 87.5% faster, it's 8 times faster, so it takes 8 times less time than before. Now it makes the idea better 😄
Haha, fair enough. It is true. The percentage doesn't give the true feeling. It really is that drastic a difference. I realized like you say, it doesn't need to be as high as I put it. I achieved a similar speed with n_batch=1024 vs n_batch=2000
Oh this also makes it possible to use 13b quantized models. It is slightly slower, by a second or two than the 7b I was using before but of course gives even better answers.
Thanks for the detailed info @DanielusG! I'll be running some more tests before merging, feel free to keep it as a branch on your repo and evolve it further, we'll definitely be using your contributions down the line! My Mac M1 crashes with n_batch > 16... so the limitation for certain computers is quite real. I will be making the readme more informational so different users can optimize for their use cases and machines.
Thanks for the detailed info @DanielusG! I'll be running some more tests before merging, feel free to keep it as a branch on your repo and evolve it further, we'll definitely be using your contributions down the line! My Mac M1 crashes with n_batch > 16... so the limitation for certain computers is quite real. I will be making the readme more informational so different users can optimize for their use cases and machines.
@imartinez is it possible that it crashes due to the small amount of RAM available? I guess the M1 has 8 or at most 16 GB of Ram if it is the laptop. on my laptop (linux arch), vicuna 13b 1.1 q5_1 use 21GB of RAM.
And as I also wrote in the description, increasing that value can result in high resource usage.
In any case, is the translator I integrated OK? I guess it goes against the purpose of this repository since it uses google translator, and thus a connection (consider that the user can choose to disable this feature from the .env), so if I have to revert the merge, I'll do it without any problems!
for reference, after running the requirements, I still had to install the following (on clean environment):
- python -m pip install python-dotenv
- pip install tqdm
- pip install langchain
- pip install chromadb
- pip install sentence_transformers
- pip install pip install sentence_transformers
- pip install llama-cpp-python
the last resulted in: nvcc fatal : Value 'native' is not defined for option 'gpu-architecture' running I5/32G RAM/Nvidia Titan 12GB VRAM. nvcc --list-gpu-arch: compute_35 compute_37 compute_50 compute_52 compute_53 compute_60 compute_61 compute_62 compute_70 compute_72 compute_75 compute_80 compute_86 compute_87
Did you use the bash script?
If so, before starting the script you must execute:
source ./venv/bin/activate.shfor activate the local enviroment.
I didn't use the script directly, my steps were:
git clone https://github.com/imartinez/privateGPT.git export LLAMA_CUBLAS=1 python -m venv create privateGPT source privateGPT/bin/activate
then i need to do: pip install llama-cpp-python which results in nvcc fatal : Value 'native' is not defined for option 'gpu-architecture'
I should say its on ubuntu 22.04.
thanks
#!/bin/bash export LLAMA_CUBLAS=1 source ~/anaconda3/bin/activate #check if venv virtual env exists if conda info --envs | grep -q "venv" then echo "env already exists" conda activate /usr/local/anaconda3/envs/venv else conda create -y -n "venv" conda activate /usr/local/anaconda3/envs/venv pip3 install -r requirements.txt fi echo "Done! Active envs:" conda info --envs
The above and running ./install_cuda.sh result in the same error. I still get ' nvcc fatal : Value 'native' is not defined for option 'gpu-architecture'', so there must be some missing library or similar causing the issue on my environment?
full trace -> https://pastebin.com/iBaAhX7n
I5/32GB/Nvidia titan 12GB thanks
this is where it fails:
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c llama.cpp -o llama.o
cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c ggml.c -o ggml.o
nvcc --forward-unknown-to-host-compiler -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_DMMV_Y=1 -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
nvcc fatal : Value 'native' is not defined for option 'gpu-architecture'
make: *** [Makefile:147: ggml-cuda.o] Error 1
ninja: build stopped: subcommand failed.
ok, installing the latest nvidia toolkit (12.1) has allowed llama-cpp-python to build correctly, seem the ubuntu packages are somewhat out of date.
also had to edit /etc/security/limits.conf to raise the memlock limit.
ok, got it working with n_batch 2000, not as fast as a previous poster but better than before
Using embedded DuckDB with persistence: data will be stored in: db
llama.cpp: loading model from models/ggml-vicuna-13B-1.1-q5_1.bin.3
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 4096
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: mem required = 8636.08 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 12 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 2722 MB
....................................................................................................
llama_init_from_file: kv self size = 3200.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
Enter a query: how do we fight inflation?
One way to fight inflation is by building a better America through investments in infrastructure, education, and job training programs that increase the productive capacity of our economy, as discussed in the given text. Another way is by cutting costs and making it easier for businesses to produce goods and services efficiently, as mentioned in the plan proposed by the
llama_print_timings: load time = 17124.50 ms
llama_print_timings: sample time = 40.61 ms / 73 runs ( 0.56 ms per token)
llama_print_timings: prompt eval time = 17124.31 ms / 1031 tokens ( 16.61 ms per token)
llama_print_timings: eval time = 47569.89 ms / 72 runs ( 660.69 ms per token)
llama_print_timings: total time = 67293.34 ms
where the previous result had:
llama_model_load_internal: [cublas] offloading 12 layers to GPU llama_model_load_internal: [cublas] total VRAM used: 2722 MB
is it GPU dependent, the one used is 12GB, would a larger GPU help? Can it run across multiple GPUs?
where the previous result had:
llama_model_load_internal: [cublas] offloading 12 layers to GPU llama_model_load_internal: [cublas] total VRAM used: 2722 MB
is it GPU dependent, the one used is 12GB, would a larger GPU help? Can it run across multiple GPUs?
Glad to hear you managed to get it working, why don't you increase the number of offload layers? having 12 gb of vram you can load most of the model into the VRAM and it should give you more performance.
As for using multiGPU I don't think it's feasible for now, you should have a look at the llama.cpp project and see if there are any updates there. llama-cpp-python is just a library that converts python to cpp, the rest is done by llamac.pp
thanks DanielusG, I tried increasing the layers, the timings didn't change much
24 layers
llama_print_timings: load time = 16568.45 ms
llama_print_timings: sample time = 36.19 ms / 64 runs ( 0.57 ms per token)
llama_print_timings: prompt eval time = 16568.28 ms / 1031 tokens ( 16.07 ms per token)
llama_print_timings: eval time = 26956.10 ms / 63 runs ( 427.87 ms per token)
llama_print_timings: total time = 45821.06 ms
40 layers
llama_print_timings: load time = 16183.76 ms
llama_print_timings: sample time = 53.52 ms / 91 runs ( 0.59 ms per token)
llama_print_timings: prompt eval time = 16183.58 ms / 1031 tokens ( 15.70 ms per token)
llama_print_timings: eval time = 33001.97 ms / 90 runs ( 366.69 ms per token)
llama_print_timings: total time = 52487.18 ms
If I try more than 40, it seems to default back to 40 layers.
Using embedded DuckDB with persistence: data will be stored in: db
llama.cpp: loading model from models/ggml-vicuna-13B-1.1-q5_1.bin.3
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 4096
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: mem required = 2282.48 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 9076 MB
....................................................................................................
llama_init_from_file: kv self size = 3200.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
Maybe my card is just old and slow? might check the bios settings, could also just be not a great motherboard as well.
thanks DanielusG, I tried increasing the layers, the timings didn't change much
24 layers llama_print_timings: load time = 16568.45 ms llama_print_timings: sample time = 36.19 ms / 64 runs ( 0.57 ms per token) llama_print_timings: prompt eval time = 16568.28 ms / 1031 tokens ( 16.07 ms per token) llama_print_timings: eval time = 26956.10 ms / 63 runs ( 427.87 ms per token) llama_print_timings: total time = 45821.06 ms 40 layers llama_print_timings: load time = 16183.76 ms llama_print_timings: sample time = 53.52 ms / 91 runs ( 0.59 ms per token) llama_print_timings: prompt eval time = 16183.58 ms / 1031 tokens ( 15.70 ms per token) llama_print_timings: eval time = 33001.97 ms / 90 runs ( 366.69 ms per token) llama_print_timings: total time = 52487.18 msIf I try more than 40, it seems to default back to 40 layers.
Using embedded DuckDB with persistence: data will be stored in: db llama.cpp: loading model from models/ggml-vicuna-13B-1.1-q5_1.bin.3 llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 4096 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 9 (mostly Q5_1) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 0.09 MB llama_model_load_internal: mem required = 2282.48 MB (+ 1608.00 MB per state) llama_model_load_internal: [cublas] offloading 40 layers to GPU llama_model_load_internal: [cublas] offloading output layer to GPU llama_model_load_internal: [cublas] total VRAM used: 9076 MB .................................................................................................... llama_init_from_file: kv self size = 3200.00 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |Maybe my card is just old and slow? might check the bios settings, could also just be not a great motherboard as well.
No, it's not your slow video card. Llama.cpp was born to run models on CPUs, recently introduced acceleration with the GPU, but it's just speeding up some types of computation, it's not running the whole model on the GPU, so llama.cpp it's still CPU intensive. So your problem will definitely be limited by your CPU. In my fork of this project I created a branch with HuggingFace hard-coded, that is, I wrote on the fly the implementation to load the models not in ggml format but in native model (so you have to download a model again if you don't have it), I do not remember if I did the commit to load them in 4bit using double quantization. In any case, be careful, that branch I did not create with the public in mind, but only try on my 6GB GPU of VRAM what it could do. But above all I did not care to fix it since today I have the physics exam at the university and I was preparing 😬 As soon as I have some time I'll fix that branch
Haven't tried this one but I was able to run the original(privateGPT) without problem on my mac M1 with 8G. Question: what would be the configuration in running this on my mac M1 with only have 8G?
Haven't tried this one but I was able to run the original(privateGPT) without problem on my mac M1 with 8G. Question: what would be the configuration in running this on my mac M1 with only have 8G?
It appears that the way in which llama.cpp loads and processes the model on M1 processors is different than on normal processors. So privateGPT without this pull request should still work fine for you. Unfortunately I don't have a MacBook available to give you any information, sorry
Any talks to run cuda inside docker? heard somewhere that it's possible:
Nvidia CUDA in a Docker container:
1. run nvidia-smi on host, it needs to run successfully
2. install nvidia-container-toolkit
3. restart Docker process
4. run a test container like so
docker run --gpus all nvidia/cuda:12.1.1-base-ubi8 nvidia-smi
Should be same as 1.
Once it works for you, you can
- pull this PR (not merged, but working): https://github.com/ggerganov/llama.cpp/pull/1461
- download a model from here, I tried with the smallest one:
- https://huggingface.co/gotzmann/LLaMA-GGML-v2/tree/main
- to create a Docker image locally, there is a description of how to do it in PR
- start the process like this
docker run --rm --gpus all -v ~/Development/LLM/Models/:/models local/llama.cpp:light-cuda -m /models/llama-7b-ggml-v2-q4_0.bin -p " Here's a haiku about a rotten banana" -n 512 --n-gpu-layers 1
(changed path to dir with models, clearly)
https://docs.docker.com/config/containers/resource_constraints/
Hi @DanielusG , I'm interested if you are of keeping an OpenAI branch to try out. Would need the readme updated on that branch to point out the MODEL_TYPE=OpenAI and anything else. Thanks.
@sime2408
Any talks to run cuda inside docker? heard somewhere that it's possible
This is definitely possible. I've used the tech you mention to deploy instant-ngp in a restricted environment that ran an older OS. Performance was great, and forwarding the UI out of the container was also possible if you have a need for a GUI.
Interestingly, it's also used by cog to streamline the deployment of ML models via docker containers. It attempts to make the packaging of the dependencies less of a headache. Not sure if that system would suit the needs of this repo, but could be worth a look as well.
I'm not a maintainer but I think it would be super helpful if you separate out all your changes and create separate PRs. That'll make it easier to test/evaluate in isolation and speed up merging!
For example, one PR just for performance improvements. One PR for translation. One PR for removing the example text. etc. etc.
I'm not a maintainer but I think it would be super helpful if you separate out all your changes and create separate PRs. That'll make it easier to test/evaluate in isolation and speed up merging!
For example, one PR just for performance improvements. One PR for translation. One PR for removing the example text. etc. etc.
I am not yet good with github, what would you suggest I do? close this PR and open several with individual features? How do I remove merge changes in my master branch? Thanks for your patience :smiling_face_with_tear:
@DanielusG you can create multiple pull requests (PRs) to the original repository from different branches of your forked repository. Each branch of your forked repository can have its own PR to the original repository
To create branches is super easy:
After cloning, navigate to the repository's directory using the command cd REPO-NAME and then create a new branch using the command git checkout -b BRANCH-NAME.
Commit, push, create PR