localGPT
localGPT copied to clipboard
Problem Running Model with GPU
I am running trying to get the prompt QA route working for my fork of this repo on an EC2 instance. I am able to run it with a CPU on my M1 laptop well enough (different model of course) but it's slow so I decided to do it on a machine that has a GPU on the cloud. However, it seems like the prompt route is so slow that it never responds at all. The server request just gets stuck in pending.
I'm a bit confused what's going on. It seems like my GPU is working hard but things just aren't working. I thought I was using the lightest model available and that my machine had plenty of resources but something is still amiss.
Model: TheBloke/wizardLM-7B-GPTQ
Model Base Name: model.safetensors
Instance Type: g4dn.2xlarge
GPUs: 1
CPUs: 8
Memory: 32 GiB
Docker log:
INFO:root:Running on: cuda
INFO:root:Display Source Documents set to: False
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: hkunlp/instructor-large
Downloading (…)c7233/.gitattributes: 100%|██████████| 1.48k/1.48k [00:00<00:00, 10.9MB/s]
Downloading (…)_Pooling/config.json: 100%|██████████| 270/270 [00:00<00:00, 2.94MB/s]
Downloading (…)/2_Dense/config.json: 100%|██████████| 116/116 [00:00<00:00, 1.10MB/s]
Downloading pytorch_model.bin: 100%|██████████| 3.15M/3.15M [00:00<00:00, 81.7MB/s]
Downloading (…)9fb15c7233/README.md: 100%|██████████| 66.3k/66.3k [00:00<00:00, 84.1MB/s]
Downloading (…)b15c7233/config.json: 100%|██████████| 1.53k/1.53k [00:00<00:00, 15.0MB/s]
Downloading (…)ce_transformers.json: 100%|██████████| 122/122 [00:00<00:00, 1.24MB/s]
Downloading pytorch_model.bin: 100%|██████████| 1.34G/1.34G [00:05<00:00, 259MB/s]
Downloading (…)nce_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 324kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 2.20k/2.20k [00:00<00:00, 13.0MB/s]
Downloading spiece.model: 100%|██████████| 792k/792k [00:00<00:00, 340MB/s]
Downloading (…)c7233/tokenizer.json: 100%|██████████| 2.42M/2.42M [00:00<00:00, 91.8MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 2.41k/2.41k [00:00<00:00, 14.7MB/s]
Downloading (…)15c7233/modules.json: 100%|██████████| 461/461 [00:00<00:00, 3.17MB/s]
INFO:root:Loading Model: TheBloke/wizardLM-7B-GPTQ, on: cuda
INFO:root:This action can take a few minutes!
INFO:root:Using AutoGPTQForCausalLM for quantized models
load INSTRUCTOR_Transformer
max_seq_length 512
The directory does not exist
Loading Model: TheBloke/wizardLM-7B-GPTQ, on: cuda
Downloading (…)okenizer_config.json: 100%|██████████| 727/727 [00:00<00:00, 5.18MB/s]
Downloading tokenizer.model: 100%|██████████| 500k/500k [00:00<00:00, 263MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.84M/1.84M [00:00<00:00, 13.0MB/s]
Downloading (…)in/added_tokens.json: 100%|██████████| 21.0/21.0 [00:00<00:00, 146kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 435/435 [00:00<00:00, 2.92MB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:root:Tokenizer loaded
Downloading (…)lve/main/config.json: 100%|██████████| 809/809 [00:00<00:00, 4.95MB/s]
Downloading (…)quantize_config.json: 100%|██████████| 158/158 [00:00<00:00, 1.01MB/s]
Downloading model.safetensors: 100%|██████████| 4.52G/4.52G [00:17<00:00, 263MB/s]]
INFO:auto_gptq.modeling._base:lm_head not been quantized, will be ignored when make_quant.
WARNING:auto_gptq.nn_modules.qlinear_old:CUDA extension not installed.
INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
WARNING:auto_gptq.nn_modules.fused_llama_mlp:skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
Downloading (…)neration_config.json: 100%|██████████| 132/132 [00:00<00:00, 1.03MB/s]
The model 'LlamaGPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PersimmonForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
INFO:root:Local LLM Loaded
got model HuggingFacePipeline
Params: {'model_id': 'gpt2', 'model_kwargs': None, 'pipeline_kwargs': None}
* Serving Flask app 'run_localGPT_API'
* Debug mode: off
INFO:werkzeug:WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:5110
* Running on http://172.17.0.2:5110
INFO:werkzeug:Press CTRL+C to quit
INFO:werkzeug:137.83.113.253 - - [12/Oct/2023 01:47:32] "GET /api/test HTTP/1.1" 200 -
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:362: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.2` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
warnings.warn(
Here's what happens when I run nvidia-smi
:
Thu Oct 12 01:58:44 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 72C P0 69W / 70W | 7651MiB / 15360MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 4088 C python 7648MiB |
+-----------------------------------------------------------------------------+
Hi did you have any luck with this? I am also having issues using the GPU
skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet. The model 'LlamaGPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
why i'm seeing this error again and agin
Hello, I got GPU to work for this program with GPTQ model. I think you can check the autogptq version-> mine is auto_gptq-0.3.0+cu118. These are the steps and versions of libraries I used to get it to work.
Download and install Anaconda Download and install Nvidia CUDA Double check CUDA installation using nvcc -V
Create virtual environment using conda and verify Python installation
conda create -n localGPT python=3.10 -c conda-forge -y conda activate localGPT python --version
installing CUDAtoolkit 11.7 (optional)
conda install -c conda-forge -y set CUDA_HOME=%CONDA_PREFIX%
Git Clone localGPT and install required libraries. (Install Pytorch with CUDA11.7 support)
git clone https://github.com/PromtEngineer/localGPT.git cd localGPT
Edit the requirements.txt file inside the folder
Comment out bitsandbytes and bitsandbytes-windows
transformers=4.35.0 sentence-transformers==2.2.2 datasets==2.14.6 qdrant_client psycopg2 pgvector
bitsandbytes @ https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.40.1.post1-py3-none-win_amd64.whl auto-gptq @ https://github.com/PanQiWei/AutoGPTQ/releases/download/v0.3.0/auto_gptq-0.3.0+cu118-cp310-cp310-win_amd64.whl torch @ https://download.pytorch.org/whl/cu117/torch-2.0.1%2Bcu117-cp310-cp310-win_amd64.whl torchvision @ https://download.pytorch.org/whl/cu117/torchvision-0.15.2%2Bcu117-cp310-cp310-win_amd64.whl torchaudio @ https://download.pytorch.org/whl/cu117/torchaudio-2.0.2%2Bcu117-cp310-cp310-win_amd64.whl
pip install -r requirements.txt
Open constants.py and configure the MODEL_ID and MODEL_BASENAME MODEL_ID = "TheBloke/Llama-2-7b-Chat-GPTQ" MODEL_BASENAME = "model.safetensors"
Run run_localGPT.py to and observe usage of GPU from task manager under performance python run_localGPT.py