localGPT Problem Running Model with GPU

I am running trying to get the prompt QA route working for my fork of this repo on an EC2 instance. I am able to run it with a CPU on my M1 laptop well enough (different model of course) but it's slow so I decided to do it on a machine that has a GPU on the cloud. However, it seems like the prompt route is so slow that it never responds at all. The server request just gets stuck in pending.

I'm a bit confused what's going on. It seems like my GPU is working hard but things just aren't working. I thought I was using the lightest model available and that my machine had plenty of resources but something is still amiss.

Model: TheBloke/wizardLM-7B-GPTQ Model Base Name: model.safetensors Instance Type: g4dn.2xlarge GPUs: 1 CPUs: 8 Memory: 32 GiB

Docker log:

INFO:root:Running on: cuda
INFO:root:Display Source Documents set to: False
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: hkunlp/instructor-large
Downloading (…)c7233/.gitattributes: 100%|██████████| 1.48k/1.48k [00:00<00:00, 10.9MB/s]
Downloading (…)_Pooling/config.json: 100%|██████████| 270/270 [00:00<00:00, 2.94MB/s]
Downloading (…)/2_Dense/config.json: 100%|██████████| 116/116 [00:00<00:00, 1.10MB/s]
Downloading pytorch_model.bin: 100%|██████████| 3.15M/3.15M [00:00<00:00, 81.7MB/s]
Downloading (…)9fb15c7233/README.md: 100%|██████████| 66.3k/66.3k [00:00<00:00, 84.1MB/s]
Downloading (…)b15c7233/config.json: 100%|██████████| 1.53k/1.53k [00:00<00:00, 15.0MB/s]
Downloading (…)ce_transformers.json: 100%|██████████| 122/122 [00:00<00:00, 1.24MB/s]
Downloading pytorch_model.bin: 100%|██████████| 1.34G/1.34G [00:05<00:00, 259MB/s]
Downloading (…)nce_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 324kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 2.20k/2.20k [00:00<00:00, 13.0MB/s]
Downloading spiece.model: 100%|██████████| 792k/792k [00:00<00:00, 340MB/s]
Downloading (…)c7233/tokenizer.json: 100%|██████████| 2.42M/2.42M [00:00<00:00, 91.8MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 2.41k/2.41k [00:00<00:00, 14.7MB/s]
Downloading (…)15c7233/modules.json: 100%|██████████| 461/461 [00:00<00:00, 3.17MB/s]
INFO:root:Loading Model: TheBloke/wizardLM-7B-GPTQ, on: cuda
INFO:root:This action can take a few minutes!
INFO:root:Using AutoGPTQForCausalLM for quantized models
load INSTRUCTOR_Transformer
max_seq_length  512
The directory does not exist
Loading Model: TheBloke/wizardLM-7B-GPTQ, on: cuda
Downloading (…)okenizer_config.json: 100%|██████████| 727/727 [00:00<00:00, 5.18MB/s]
Downloading tokenizer.model: 100%|██████████| 500k/500k [00:00<00:00, 263MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.84M/1.84M [00:00<00:00, 13.0MB/s]
Downloading (…)in/added_tokens.json: 100%|██████████| 21.0/21.0 [00:00<00:00, 146kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 435/435 [00:00<00:00, 2.92MB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:root:Tokenizer loaded
Downloading (…)lve/main/config.json: 100%|██████████| 809/809 [00:00<00:00, 4.95MB/s]
Downloading (…)quantize_config.json: 100%|██████████| 158/158 [00:00<00:00, 1.01MB/s]
Downloading model.safetensors: 100%|██████████| 4.52G/4.52G [00:17<00:00, 263MB/s]]
INFO:auto_gptq.modeling._base:lm_head not been quantized, will be ignored when make_quant.
WARNING:auto_gptq.nn_modules.qlinear_old:CUDA extension not installed.
INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
WARNING:auto_gptq.nn_modules.fused_llama_mlp:skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
Downloading (…)neration_config.json: 100%|██████████| 132/132 [00:00<00:00, 1.03MB/s]
The model 'LlamaGPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PersimmonForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
INFO:root:Local LLM Loaded
got model HuggingFacePipeline
Params: {'model_id': 'gpt2', 'model_kwargs': None, 'pipeline_kwargs': None}
 * Serving Flask app 'run_localGPT_API'
 * Debug mode: off
INFO:werkzeug:WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5110
 * Running on http://172.17.0.2:5110
INFO:werkzeug:Press CTRL+C to quit
INFO:werkzeug:137.83.113.253 - - [12/Oct/2023 01:47:32] "GET /api/test HTTP/1.1" 200 -
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:362: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.2` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(

Here's what happens when I run nvidia-smi:


Thu Oct 12 01:58:44 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   72C    P0    69W /  70W |   7651MiB / 15360MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4088      C   python                           7648MiB |
+-----------------------------------------------------------------------------+

Oct 12 '23 02:10 scefali

Hi did you have any luck with this? I am also having issues using the GPU

Dec 04 '23 16:12 elemets

skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet. The model 'LlamaGPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].

why i'm seeing this error again and agin

Jan 05 '24 07:01 shraddhaa26

Hello, I got GPU to work for this program with GPTQ model. I think you can check the autogptq version-> mine is auto_gptq-0.3.0+cu118. These are the steps and versions of libraries I used to get it to work.

Download and install Anaconda Download and install Nvidia CUDA Double check CUDA installation using nvcc -V

Create virtual environment using conda and verify Python installation

conda create -n localGPT python=3.10 -c conda-forge -y conda activate localGPT python --version

installing CUDAtoolkit 11.7 (optional)

conda install -c conda-forge -y set CUDA_HOME=%CONDA_PREFIX%

Git Clone localGPT and install required libraries. (Install Pytorch with CUDA11.7 support)

git clone https://github.com/PromtEngineer/localGPT.git cd localGPT

Edit the requirements.txt file inside the folder

Comment out bitsandbytes and bitsandbytes-windows

transformers=4.35.0 sentence-transformers==2.2.2 datasets==2.14.6 qdrant_client psycopg2 pgvector

bitsandbytes @ https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.40.1.post1-py3-none-win_amd64.whl auto-gptq @ https://github.com/PanQiWei/AutoGPTQ/releases/download/v0.3.0/auto_gptq-0.3.0+cu118-cp310-cp310-win_amd64.whl torch @ https://download.pytorch.org/whl/cu117/torch-2.0.1%2Bcu117-cp310-cp310-win_amd64.whl torchvision @ https://download.pytorch.org/whl/cu117/torchvision-0.15.2%2Bcu117-cp310-cp310-win_amd64.whl torchaudio @ https://download.pytorch.org/whl/cu117/torchaudio-2.0.2%2Bcu117-cp310-cp310-win_amd64.whl

pip install -r requirements.txt

Open constants.py and configure the MODEL_ID and MODEL_BASENAME MODEL_ID = "TheBloke/Llama-2-7b-Chat-GPTQ" MODEL_BASENAME = "model.safetensors"

Run run_localGPT.py to and observe usage of GPU from task manager under performance python run_localGPT.py

Apr 13 '24 16:04 anabellechan

localGPT localGPT copied to clipboard

Problem Running Model with GPU

localGPT
localGPT copied to clipboard