jan When using the GPU, is the model loaded into VRAM?

When using the GPU, is the model loaded into VRAM?

Open SmirkingKitsune opened this issue 1 year ago • 1 comments

Discussed in https://github.com/janhq/jan/discussions/1808

^{Originally posted by Nord1cWarr1or January 26, 2024} Can someone please explain to me how this works. I have 32gb RAM, and 8gb VRAM. When I use GPU acceleration, I can't run large models. But when I don't use GPU acceleration, I can run them.

So the original poster wants to run a 13B GGUF model. A ~13B GGUF should take around ~7.8 GB of memory according to the llama.cpp repo. His system has 32 GB of RAM but 8 GB of VRAM. GGUF should be working off of RAM if I understood llama.cpp correctly.

It seems that the Jan Hub recommendation checker is checking VRAM instead of RAM when the GPU accelerator is turned on. Since Jan is configured to use GGUF, I think that the recommendation checker is giving a false not recommended tag based on VRAM when GPU accelerator is enabled.

Mar 03 '24 10:03 SmirkingKitsune

We will modify the UI to highlight that the recommendation is based on:

CPU => Not enough RAM
GPU => Not enough vRAM cc: @louis-jan

cc: @hiento09 for the investigation of vulkan VRAM

Mar 04 '24 04:03 Van-QA

@RookHyena. Thank you for helping to lead the discussion. We've corrected the recommended tag based on RAM, VRAM, and GPU acceleration (on/off).

There is also an ngl setting to configure GPU offload layers. We can currently configure it using model.json, but will soon bring it to the GUI.

Apr 04 '24 09:04 louis-jan

jan jan copied to clipboard

When using the GPU, is the model loaded into VRAM?

Discussed in https://github.com/janhq/jan/discussions/1808

jan
jan copied to clipboard