localGPT very slow GPU compared with CPU

Hi all ! model is working great ! i am trying to use my 8GB 4060TI with MODEL_ID = "TheBloke/vicuna-7B-v1.5-GPTQ" MODEL_BASENAME = "model.safetensors"

I changed the GPU today, the previous one was old.

But it takes a few minutes to get a result... however I notice now im getting this messages while running the model :

2023-09-06 19:16:07,759 - INFO - _base.py:727 - lm_head not been quantized, will be ignored when make_quant.
2023-09-06 19:16:07,760 - WARNING - qlinear_old.py:16 - **CUDA extension not installed.**
2023-09-06 19:16:12,071 - WARNING - fused_llama_mlp.py:306 - skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
C:\Users\a\anaconda3\Lib\site-packages\transformers\generation\configuration_utils.py:362: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
C:\Users\a\anaconda3\Lib\site-packages\transformers\generation\configuration_utils.py:367: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
The model 'LlamaGPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
2023-09-06 19:16:12,184 - INFO - run_localGPT.py:127 - Local LLM Loaded

can someone tell me what is going on ?

Sep 06 '23 17:09 EISMANN-DEV

if you use the nvidia-smi command, what is your vram usage?

Sep 06 '23 20:09 LeafmanZ

Hi ! while running it :

+---------------- | Processes: | GPU GI CI | ID ID |================ | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A +---------------- ` -----------------------------------------------------------------------+ | PID Type Process name GPU Memory | Usage | =======================================================================| 2884 C+G ...72.0_x64__8wekyb3d8bbwe\GameBar.exe N/A | 3512 C+G ...__8wekyb3d8bbwe\WindowsTerminal.exe N/A | 7636 C+G ...siveControlPanel\SystemSettings.exe N/A | 8560 C ...\anaconda3\envs\localGPT\python.exe N/A | 9952 C+G C:\Windows\explorer.exe N/A | 10800 C+G ...2txyewy\StartMenuExperienceHost.exe N/A | 11008 C+G ...les (x86)\Battle.net\Battle.net.exe N/A | 12016 C+G ...les\Microsoft OneDrive\OneDrive.exe N/A | 12176 C+G ...t.LockApp_cw5n1h2txyewy\LockApp.exe N/A | 13348 C+G ...GeForce Experience\NVIDIA Share.exe N/A | 13636 C+G ...air\Corsair iCUE5 Software\iCUE.exe N/A | 14600 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A | 15100 C+G ...inaries\Win64\EpicGamesLauncher.exe N/A | 15384 C+G C:\Program Files\NZXT CAM\NZXT CAM.exe N/A | 15672 C+G ...ne\Binaries\Win64\EpicWebHelper.exe N/A | 16296 C+G C:\Program Files\NZXT CAM\NZXT CAM.exe N/A | 18364 C+G ...Programs\Microsoft VS Code\Code.exe N/A | 19188 C+G ...5n1h2txyewy\ShellExperienceHost.exe N/A | 19720 C+G ...nt.CBS_cw5n1h2txyewy\SearchHost.exe N/A | 20388 C+G ...41.0_x64__zpdnekdrzrea0\Spotify.exe N/A | 23468 C+G ...oogle\Chrome\Application\chrome.exe N/A | -----------------------------------------------------------------------+

Im pretty sure something is interfering with this card since other computer at my work run it well with the same specs more or less... they are also giving the cuda message i posted above, but the model is still okay, i can see by task manager the card is being used while giving text.

Sep 06 '23 20:09 EISMANN-DEV

can a oobabooga installation have an effect related to this ?

Sep 06 '23 21:09 EISMANN-DEV

So I managed to fix it, first reinstalled oobabooga with cuda support (I dont know if it influenced localGPT), then completely reinstalled localgpt and its environment.

EDIT : I read somewhere that there is a problem with allocating memory with the new Nvidia drivers, I am now using 537.13 but have to use 532.03 for it to work. There post I read said 531 were safe to use, while my 4060TI has only 532.03 because it was just released after 531.

Sep 07 '23 07:09 EISMANN-DEV

I’m running docker on windows to use gptq model, response is slow though it is using 12GB GPU, what can be the reason, how to handle it ? Google colab uses 12GB GPU and it is fast. Model: Llama 2 7B chat GPTQ

Sep 10 '23 04:09 Saman28Khan

I’m running docker on windows to use gptq model, response is slow though it is using 12GB GPU, what can be the reason, how to handle it ? Google colab uses 12GB GPU and it is fast. Model: Llama 2 7B chat GPTQ

hi,

have you managed to run this on google colab? Please can you share the details on runtime and the workbook if possible. I am trying to run in colab on T4 GPU with 12GB CPU and 15GB GPU RAM but it keeps crashing after entering the prompt with the following error :

Enter a query: how to elect american president ggml_allocr_alloc: not enough space in the buffer (needed 143278592, largest block available 17334272) GGML_ASSERT: ggml-alloc.c:139: !"not enough space in the buffer"

Sep 10 '23 22:09 shishir332

!pip install --upgrade tensorrt !git clone https://github.com/PromtEngineer/localGPT.git %cd localGPT !pip install -r requirements.txt !python ingest.py --device_type cuda !python run_localGPT.py --device_type cuda

Sep 11 '23 17:09 Saman28Khan

!pip install --upgrade tensorrt !git clone https://github.com/PromtEngineer/localGPT.git %cd localGPT !pip install -r requirements.txt !python ingest.py --device_type cuda !python run_localGPT.py --device_type cuda

Thanks, but that doesn't work anymore on T4 GPU. I tried to upgrade to a better GPU on colab pro but to no avail. 👎

Sep 11 '23 18:09 shishir332

In constants.py file, change MODEL_ID to TheBloke/Llama-2-7b-Chat-GPTQ And MODEL_BASENAME to model.safetensors

Sep 12 '23 03:09 Saman28Khan

So i ditched my RTX 4060 TI and moved to a RTX 4070, 8GB vs 12GB.

I dont get any answer from this model, it just hangs : MODEL_ID = "TheBloke/Llama-2-13B-GPTQ" MODEL_BASENAME = "model.safetensors"

And this model :
MODEL_ID = "TheBloke/vicuna-7B-v1.5-GPTQ" MODEL_BASENAME = "model.safetensors"

just gives a blank answer... does anyone know what is happening ?

Sep 13 '23 15:09 EISMANN-DEV

So I can confirm the models stopped working only because im now using run_localGPT_v2.py, when going back to run_localGPT.py its working again. Something for you @PromtEngineer ? thanks for the effort

Sep 13 '23 15:09 EISMANN-DEV

@N1h1lv5 I hope with this new update, the issue is solved. Can you please confirm?

Sep 18 '23 07:09 PromtEngineer

@N1h1lv5 I hope with this new update, the issue is solved. Can you please confirm?

The new run_localGPT.py is working, but some models still give empty answers, as you know.

Sep 18 '23 08:09 EISMANN-DEV

I tired this docker file with CUDA 11.7 .. Observing error :

NVIDIA Driver on your system is too OLD ---> alternatively go pytorch version

@PromtEngineer - Any suggestion, highly appreciated. Thanks in advance.

Oct 10 '23 05:10 WIIN-AI

I don't know if anyone has tried it, but if you use GPTQ, there was a warning that says to remove the temperature. So I tried removing it, and everything works great.

run_localGPT.py

Apr 17 '24 05:04 Bhavya031