EmbedAI
EmbedAI copied to clipboard
how to use gpu instead cpu
can we use the gpu to get response more faster than use cpu ?
GPT4All doesn't support GPU acceleration. Will add support for models like Llama which can do this
I was able to get GPU working with this Llama model: ggml-vic13b-q5_1.bin
using a manual workaround.
# Download the ggml-vic13b-q5_1.bin model and place in privateGPT/server/models/
# Edit privateGPT.py and comment out GPT4 model and add LLama model
# Change n_gpu_layers=40 layers based on what Nvidia GPU (max is 40). Uses about 9GB VRAM.
def load_model():
filename = 'ggml-vic13b-q5_1.bin' # Specify the name for the downloaded file
models_folder = 'models' # Specify the name of the folder inside the Flask app root
file_path = f'{models_folder}/{filename}'
if os.path.exists(file_path):
global llm
callbacks = [StreamingStdOutCallbackHandler()]
#llm = GPT4All(model=model_path, n_ctx=model_n_ctx, backend='gptj', callbacks=callbacks, verbose=False)
llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, n_gpu_layers=40, callbacks=callbacks, verbose=False)
# Edit privateGPT/server/.env
# Update .env as follows
PERSIST_DIRECTORY=db
MODEL_TYPE=LlamaCpp
MODEL_PATH=models/ggml-vic13b-q5_1.bin
EMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2
MODEL_N_CTX=1000
# If using conda enviroment
conda install -c "nvidia/label/cuda-12.1.1" cuda-toolkit
# Remove and reinstall llama-cpp-python with ENV variables set
# Linux uses "export" not "set" like Windows for setting environment variables
pip uninstall llama-cpp-python
export CMAKE_ARGS="-DLLAMA_CUBLAS=on"
export FORCE_CMAKE=1
pip install llama-cpp-python --no-cache-dir
Run python privateGPT
from privateGPT/server/ directory
You should see the following lines in output as the model loads
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 9076 MB
Hi, thanks for your info. But when I was following your step in my windows, I got this error: Could not load Llama model from path: D:/code/privateGPT/server/models/ggml-vic13b-q5_1.bin. Received error (type=value_error) Any idea about this? Thanks.
@bradsec
Hi,
I followed the instructions but looks still using cpu :
(venPrivateGPT) (base) alp2080@alp2080:~/data/dProjects/privateGPT/server$ python privateGPT.py /data/dProjects/privateGPT/server/privateGPT.py:1: DeprecationWarning: 'flask.Markup' is deprecated and will be removed in Flask 2.4. Import 'markupsafe.Markup' instead. from flask import Flask,jsonify, render_template, flash, redirect, url_for, Markup, request llama.cpp: loading model from models/ggml-vic13b-q5_1.bin llama_model_load_internal: format = ggjt v2 (pre #1508) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 1000 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 9 (mostly Q5_1) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 0.09 MB llama_model_load_internal: mem required = 11359.05 MB (+ 1608.00 MB per state) llama_new_context_with_model: kv self size = 781.25 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | LLM0 LlamaCpp Params: {'model_path': 'models/ggml-vic13b-q5_1.bin', 'suffix': None, 'max_tokens': 256, 'temperature': 0.8, 'top_p': 0.95, 'logprobs': None, 'echo': False, 'stop_sequences': [], 'repeat_penalty': 1.1, 'top_k': 40}
- Serving Flask app 'privateGPT'
- Debug mode: off WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
- Running on all addresses (0.0.0.0)
- Running on http://127.0.0.1:5000
- Running on http://192.168.5.110:5000 Press CTRL+C to quit Loading documents from source_documents
I tried this as well and it looks like it's still using CPU.. interesting. If anyone could suggest as to why it's not working with gpu, please let me know.