jan
jan copied to clipboard
When using the GPU, is the model loaded into VRAM?
Discussed in https://github.com/janhq/jan/discussions/1808
Originally posted by Nord1cWarr1or January 26, 2024 Can someone please explain to me how this works. I have 32gb RAM, and 8gb VRAM. When I use GPU acceleration, I can't run large models. But when I don't use GPU acceleration, I can run them.
So the original poster wants to run a 13B GGUF model. A ~13B GGUF should take around ~7.8 GB of memory according to the llama.cpp repo. His system has 32 GB of RAM but 8 GB of VRAM. GGUF should be working off of RAM if I understood llama.cpp correctly.
It seems that the Jan Hub recommendation checker is checking VRAM instead of RAM when the GPU accelerator is turned on. Since Jan is configured to use GGUF, I think that the recommendation checker is giving a false not recommended tag based on VRAM when GPU accelerator is enabled.
We will modify the UI to highlight that the recommendation is based on:
- CPU => Not enough RAM
- GPU => Not enough vRAM
cc: @louis-jan
cc: @hiento09 for the investigation of vulkan VRAM
@RookHyena. Thank you for helping to lead the discussion. We've corrected the recommended tag based on RAM, VRAM, and GPU acceleration (on/off).
There is also an ngl setting to configure GPU offload layers. We can currently configure it using model.json, but will soon bring it to the GUI.