EmbedAI icon indicating copy to clipboard operation
EmbedAI copied to clipboard

how to use gpu instead cpu

Open mnofrizal opened this issue 1 year ago • 5 comments

can we use the gpu to get response more faster than use cpu ?

mnofrizal avatar May 26 '23 15:05 mnofrizal

GPT4All doesn't support GPU acceleration. Will add support for models like Llama which can do this

Anil-matcha avatar May 27 '23 01:05 Anil-matcha

I was able to get GPU working with this Llama model: ggml-vic13b-q5_1.bin using a manual workaround.

# Download the ggml-vic13b-q5_1.bin model and place in privateGPT/server/models/
# Edit privateGPT.py and comment out GPT4 model and add LLama model
# Change n_gpu_layers=40 layers based on what Nvidia GPU (max is 40). Uses about 9GB VRAM.

def load_model():
    filename = 'ggml-vic13b-q5_1.bin'  # Specify the name for the downloaded file
    models_folder = 'models'  # Specify the name of the folder inside the Flask app root
    file_path = f'{models_folder}/{filename}'
    if os.path.exists(file_path):
        global llm
        callbacks = [StreamingStdOutCallbackHandler()]
        #llm = GPT4All(model=model_path, n_ctx=model_n_ctx, backend='gptj', callbacks=callbacks, verbose=False)
        llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, n_gpu_layers=40, callbacks=callbacks, verbose=False)

# Edit privateGPT/server/.env

# Update .env as follows
PERSIST_DIRECTORY=db
MODEL_TYPE=LlamaCpp
MODEL_PATH=models/ggml-vic13b-q5_1.bin
EMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2
MODEL_N_CTX=1000

# If using conda enviroment
conda install -c "nvidia/label/cuda-12.1.1" cuda-toolkit

# Remove and reinstall llama-cpp-python with ENV variables set
# Linux uses "export" not "set" like Windows for setting environment variables

pip uninstall llama-cpp-python
export CMAKE_ARGS="-DLLAMA_CUBLAS=on"
export FORCE_CMAKE=1
pip install llama-cpp-python --no-cache-dir

Run python privateGPT from privateGPT/server/ directory You should see the following lines in output as the model loads

llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 9076 MB

bradsec avatar May 30 '23 01:05 bradsec

Hi, thanks for your info. But when I was following your step in my windows, I got this error: Could not load Llama model from path: D:/code/privateGPT/server/models/ggml-vic13b-q5_1.bin. Received error (type=value_error) Any idea about this? Thanks.

jackyoung022 avatar May 30 '23 09:05 jackyoung022

@bradsec

Hi,

I followed the instructions but looks still using cpu :

(venPrivateGPT) (base) alp2080@alp2080:~/data/dProjects/privateGPT/server$ python privateGPT.py /data/dProjects/privateGPT/server/privateGPT.py:1: DeprecationWarning: 'flask.Markup' is deprecated and will be removed in Flask 2.4. Import 'markupsafe.Markup' instead. from flask import Flask,jsonify, render_template, flash, redirect, url_for, Markup, request llama.cpp: loading model from models/ggml-vic13b-q5_1.bin llama_model_load_internal: format = ggjt v2 (pre #1508) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 1000 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 9 (mostly Q5_1) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 0.09 MB llama_model_load_internal: mem required = 11359.05 MB (+ 1608.00 MB per state) llama_new_context_with_model: kv self size = 781.25 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | LLM0 LlamaCpp Params: {'model_path': 'models/ggml-vic13b-q5_1.bin', 'suffix': None, 'max_tokens': 256, 'temperature': 0.8, 'top_p': 0.95, 'logprobs': None, 'echo': False, 'stop_sequences': [], 'repeat_penalty': 1.1, 'top_k': 40}

  • Serving Flask app 'privateGPT'
  • Debug mode: off WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
  • Running on all addresses (0.0.0.0)
  • Running on http://127.0.0.1:5000
  • Running on http://192.168.5.110:5000 Press CTRL+C to quit Loading documents from source_documents

MyraBaba avatar Jun 29 '23 11:06 MyraBaba

I tried this as well and it looks like it's still using CPU.. interesting. If anyone could suggest as to why it's not working with gpu, please let me know.

Musty1 avatar Jul 19 '23 10:07 Musty1