llama.cpp
llama.cpp copied to clipboard
Feature Request: Avoid loading GPU layers into RAM before moving them to VRAM. This should allow the use of --no-mmap with models that do not fit in RAM but fit in RAM+VRAM.
Prerequisites
- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the README.md.
- [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Hello,
I am currently working on running Llama 405B and DeepSeek Coder V2 on my setup, which includes 128GB of RAM and 24GB of VRAM.
To run these large models effectively, I need to avoid disk caching, as it severely impacts performance. This is why I am using the --no-mmap
option.
The problem is that llama.cpp initially loads the entire model and its layers into RAM before offloading some layers into VRAM.
Given this, the largest models I can run without dipping into painfully slow token-per-minute territory are limited by my RAM capacity.
It would be highly beneficial if the --no-mmap
option could be applied only to the layers that remain in RAM, or if the necessary layers could be directly loaded into VRAM.
With these modifications, we could load larger models and higher quantization levels with minimal speed loss, and avoid relying on disk caching when a model fits within the combined RAM + VRAM but not in RAM alone.
Here are the current speeds I achieve with Llama 3.1 405B Instruct, offloading the maximum number of layers for each:
Model Quant | Size (MB) | Speed (tok/s) | --no-mmap |
---|---|---|---|
IQ2_S | 121,544 | 0.42 | Enabled |
IQ2_M | 132,116 | 0.38 | Enabled |
IQ3_XXS | 150,407 | Crash | Enabled |
IQ3_XXS | 150,407 | 0.02 | Disabled |
Motivation
It would be very useful to be able to load larger models with higher quants without having to rely on disk caching, greatly improving the speed of these models.
Possible Implementation
It would be highly beneficial if the
--no-mmap
option could be applied only to the layers that remain in RAM, or if the necessary layers could be directly loaded into VRAM.