gpt4all icon indicating copy to clipboard operation
gpt4all copied to clipboard

Fallback to CPU with OOM even though GPU *should* have more than enough

Open kasperske opened this issue 9 months ago • 24 comments

System Info

version: 1.0.12 platform: windows python: 3.11.4 graphics card: nvidia rtx 4090 24gb

Information

  • [ ] The official example notebooks/scripts
  • [X] My own modified scripts

Reproduction

run the following code

from gpt4all import GPT4All
model = GPT4All("wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0", device='gpu') # device='amd', device='intel'
output = model.generate("Write a Tetris game in python scripts", max_tokens=4096); print(output)

Expected behavior

Found model file at  C:\\\\Users\\\\earne\\\\.cache\\\\gpt4all\\wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0.bin
llama.cpp: loading model from C:\\\\Users\\\\earne\\\\.cache\\\\gpt4all\\wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 6983.73 MB
Error allocating memory ErrorOutOfDeviceMemory
error loading model: Error allocating vulkan memory.
llama_load_model_from_file: failed to load model
LLAMA ERROR: failed to load model from C:\\\\Users\\\\earne\\\\.cache\\\\gpt4all\\wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0.bin
LLaMA ERROR: prompt won't work with an unloaded model!

kasperske avatar Oct 20 '23 16:10 kasperske

This is because you don't have enough VRAM available to load the model. Yes, I know your GPU has a lot of VRAM but you probably have this GPU set in your BIOS to be the primary GPU which means that Windows is using some of it for the Desktop and I believe the issue is that although you have a lot of shared memory available, it isn't contiguous because of fragmentation due to Windows.

manyoso avatar Oct 28 '23 22:10 manyoso

This is because you don't have enough VRAM available to load the model. Yes, I know your GPU has a lot of VRAM but you probably have this GPU set in your BIOS to be the primary GPU which means that Windows is using some of it for the Desktop and I believe the issue is that although you have a lot of shared memory available, it isn't contiguous because of fragmentation due to Windows.

Absolutely not the case. I have tried loading a model that will take at most 5-6Gb on my RTX 3090 and it doesn't work. I can load up other machine learning applications and use 20Gb. There is definitely a problem here. Sitting on desktop DOES NOT take 20+ Gb of VRAM.

PHIL-GIBSON-1990 avatar Oct 29 '23 04:10 PHIL-GIBSON-1990

I believe what manyoso is saying is that our Vulkan backend currently requires a contiguous chunk of memory to be available, as it allocates one big chunk instead of smaller chunks like other machine learning frameworks do. This means it would probably work fine if you didn't have other things using small chunks in the middle of your VRAM. We still intend to fix this issue :)

cebtenzzre avatar Oct 29 '23 05:10 cebtenzzre

It seems that there is no way around this? I have dual 3090s, and specifically selecting either of them will throw this error. Im not sure that the information about "contiguous blocks" in memory is useful, as there is generally no way to enable specific use of GPU's in BIOS, and this really shouldn't be an issue as I understand it. Has anyone found a workaround?

BryceDrechselSmith avatar Nov 13 '23 22:11 BryceDrechselSmith

on my 16GB RTX only models run with GPU smaller than 4GB it uses 5GB of RAM, whether it is talking or not ... i can log it with GPU-Z

another model 8GB in size uses ~9GB VRAM and run only on cpu (always say "out VRAM")

-> So my conclusion is that it's a simple programming error as the model doesn't use that much more vram than its actual size

kalle07 avatar Nov 26 '23 16:11 kalle07

models that run on my 16GB RTX with GPU (how good i can not say) ;)

nearly all tinyllama models

and one german model sauerkrautlm-3b-v1.Q4_1

and the build in download versions from orca-2-7b.Q4_0.gguf gpt4all falcon

often only the Q4 models are working

kalle07 avatar Nov 28 '23 20:11 kalle07

often only the Q4 models are working

We only support GPU acceleration of Q4_0 and Q4_1 quantizations at the moment.

cebtenzzre avatar Nov 29 '23 04:11 cebtenzzre

sauerkrautlm-7b-hero.Q5_K_M.gguf a german model that runs on CPU but very well, also with local docs

kalle07 avatar Dec 01 '23 16:12 kalle07

We only support GPU acceleration of Q4_0 and Q4_1 quantizations at the moment.

I can't load a Q4_0 into VRAM on either of my 4090s, each with 24gb. Came to this issue after my duplicate issue was closed.

I can literally open the exact model downloaded by GPT4All, orca-2-13b.Q4_0.gguf, in textgen-webui, offload ALL layers to GPU, and see a speed increase. I can use the exact same model as GPTQ and see a HUGE speed increase even over the GGUF-when-fully-in-VRAM.

Why can't we use GPTQ? I don't understand why so many LLM apps are so limited, and so dead-set on slow, CPU generation. Why not just include the option for GPU by default and fall back to CPU for those that don't have it? Let's face it, not many people on PC are trying out local LLMs without GPUs.

Anyway, you say you support Q4_0 and Q4_1 for GPU but that model WILL NOT load into VRAM when I have 24gb, and it WILL load the exact same file into VRAM using a different LLM app. So the problem is clearly with GPT4All.

ewebgh33 avatar Dec 21 '23 02:12 ewebgh33

I can't load a Q4_0 into VRAM on either of my 4090s, each with 24gb.

Just so you're aware, GPT4All uses a completely different GPU backend than the other LLM apps you're familiar with - it's an original implementation based on Vulkan. It's still in its early stages (because bugs like this need to be fixed before it can be considered mature), but the main benefit is that it's easy to support NVIDIA, AMD, and Intel all with the same code.

exllama2 is great if you have two 4090s - GPT4All in its current state probably isn't for you, as it definitely doesn't take full advantage of your hardware. But many of our users do not have access to such impressive GPUs (myself included) and benefit from features that llama.cpp makes it relatively easy to support, such as partial GPU offload - which we haven't implemented yet, but plan to.

cebtenzzre avatar Dec 21 '23 03:12 cebtenzzre

Thanks for the explanation! So basically I need to wait until this Vulkan thing is... better?

I appreciate you want to support all of Mac, AMD and Nvidia, that's a great goal. But I agree I would be more likely to make this my main app, at a time when full GPU support can come in.

The main reason I am looking at tools like GPT4All is that the more basic tools like textgen-webui or LMStudio don't have pipelines for RAG. GPT4All had a few recommendations to me from a reddit post where I asked about various LLM+RAG pipelines, so I wanted to test it out.

I've tested a few now, and similar to GPT4all, I end up finding they're all CPU bound with rough or no support for GPU. Honestly the speed of CPU is incredibly painful and I can't live with that slow speed! :)

ewebgh33 avatar Dec 21 '23 03:12 ewebgh33

What about adding ability to connect to a different API? textgen-webui supports OpenAI API standard, and in fact other LLM apps can connect to textgen-webui for GPU support. Would you consider adding that ability as a stopgap, until Vulkan improves? It would keep your existing compatibilities with Mac/AMD but open up a whole new world to other GPU users.

ewebgh33 avatar Dec 21 '23 03:12 ewebgh33

What about adding ability to connect to a different API?

Since we already support connecting to ChatGPT, that would be a reasonable feature request - you should open an issue for it.

cebtenzzre avatar Dec 21 '23 03:12 cebtenzzre

Thanks, I'll look into opening one. Glad to hear you are receptive to this!

I have found in other LLM apps that it takes a dedicated GUI option somewhere, as locally the API doesn't require authentication as OpenAI does, it's simply an end-point. Or something like that anyway!

Cheers and thanks again, Em

ewebgh33 avatar Dec 21 '23 04:12 ewebgh33

I had these issues and switched over to using transformers (huggingfacepipeline) and now i can take advantage of dual 3090's

BryceDrechselSmith avatar Dec 21 '23 17:12 BryceDrechselSmith

these error is now for 2 month :D

kalle07 avatar Dec 21 '23 18:12 kalle07

possible GPU usage ... tested with version 1.6.1 today !

my VRAM is 16GB (RTX4060) only models lower than 3.8GB works on my GPU (without docs, that always run on CPU)

so try models like: wizardlm-7b-v1.0-uncensored.Q3_K_M.gguf open_llama_3b_code_instruct_0.1.q4_k_m.gguf syzymon-long_llama_3b_instruct-Q4_K_M.gguf

or modelsize lower than the 1/4 of your max VRAM

THY me alone :)

kalle07 avatar Jan 20 '24 20:01 kalle07

I am in the same boat. I can load 8GB models but cannot load 16GB ones

GPU processor:		NVIDIA RTX A2000 8GB Laptop GPU
Driver version:		532.09
Total available graphics memory:	24411 MB
Dedicated video memory:	8192 MB GDDR6
System video memory:	0 MB
Shared system memory:	16219 MB

There is 8+16=24GB total available. When loading 8GB models I can see about half goes to shared RAM and half to dedicated so after loading 8GB model I still have about 4GB free in dedicated VRAM. I can even run gpt4all twice and load two 8gb models filling the dedicated memory to 8GB however with one 16GB model I get "out of VRAM?".

fanoush avatar Feb 15 '24 15:02 fanoush

its an error related on Vulkan they believe in it since 5 month ^^

kalle07 avatar Feb 15 '24 16:02 kalle07

You should be able to use partial offloading now to load some number of the layers of the model into VRAM even for 16GB models. I'm going to wait a bit for those who have experienced issues to comment and verify they can use partial offloading, but in lieu of no new comments this issue will be closed as fixed due to partial offloading now supported.

manyoso avatar Mar 06 '24 13:03 manyoso

in lieu of no new comments this issue will be closed as fixed due to partial offloading now supported.

Although it may be possible for some users to mitigate this issue with partial offloading, it is still an issue - people should be able to fully offload models with GPT4All that they can fully offload with every other LLM app using the CUDA backend.

cebtenzzre avatar Mar 06 '24 17:03 cebtenzzre

Although it may be possible for some users to mitigate this issue with partial offloading, it is still an issue - people should be able to fully offload models with GPT4All that they can fully offload with every other LLM app using the CUDA backend.

If this is the case, then we must be requiring some flag on the memory that others do not if the issue is not one of contiguous regions

manyoso avatar Mar 06 '24 21:03 manyoso

Thanks for the tip, I can confirm I can load 16GB Wizard 1.2 when reducing GPU layers to 36 with NVIDIA RTX A2000 8GB, then the allocation looks like this image

fanoush avatar Mar 10 '24 07:03 fanoush

The problem with with partial offloading is difficult : If You work with better resolution than GPT4ALL standard Q4 (Q6 is recommended for professional work) You need about 48GB and more (34b needs even more). P.S. Some AMD - cards like R5-430 see (in the windows taskmanager) the rest of the computer-ram as reserve (and include it).

gtbu avatar Apr 19 '24 14:04 gtbu