llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Feature Request: Avoid loading GPU layers into RAM before moving them to VRAM. This should allow the use of --no-mmap with models that do not fit in RAM but fit in RAM+VRAM.

Open ThomasBaruzier opened this issue 6 months ago • 5 comments

Prerequisites

  • [X] I am running the latest code. Mention the version if possible as well.
  • [X] I carefully followed the README.md.
  • [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Hello,

I am currently working on running Llama 405B and DeepSeek Coder V2 on my setup, which includes 128GB of RAM and 24GB of VRAM.

To run these large models effectively, I need to avoid disk caching, as it severely impacts performance. This is why I am using the --no-mmap option.

The problem is that llama.cpp initially loads the entire model and its layers into RAM before offloading some layers into VRAM.

Given this, the largest models I can run without dipping into painfully slow token-per-minute territory are limited by my RAM capacity.

It would be highly beneficial if the --no-mmap option could be applied only to the layers that remain in RAM, or if the necessary layers could be directly loaded into VRAM.

With these modifications, we could load larger models and higher quantization levels with minimal speed loss, and avoid relying on disk caching when a model fits within the combined RAM + VRAM but not in RAM alone.

Here are the current speeds I achieve with Llama 3.1 405B Instruct, offloading the maximum number of layers for each:

Model Quant Size (MB) Speed (tok/s) --no-mmap
IQ2_S 121,544 0.42 Enabled
IQ2_M 132,116 0.38 Enabled
IQ3_XXS 150,407 Crash Enabled
IQ3_XXS 150,407 0.02 Disabled

Motivation

It would be very useful to be able to load larger models with higher quants without having to rely on disk caching, greatly improving the speed of these models.

Possible Implementation

It would be highly beneficial if the --no-mmap option could be applied only to the layers that remain in RAM, or if the necessary layers could be directly loaded into VRAM.

ThomasBaruzier avatar Aug 16 '24 17:08 ThomasBaruzier