gpt4all icon indicating copy to clipboard operation
gpt4all copied to clipboard

llamamodel: prevent CUDA OOM crash by allocating VRAM early

Open cebtenzzre opened this issue 1 month ago • 2 comments

This is a proposed fix for the issue where CUDA OOM can happen later than expected and crash GPT4All. The question is whether the benefit (falling back early instead of crashing later) is worth the load latency cost.

After a model is loaded onto a CUDA device, we run one full batch of (meaningless) input through it. Small batches don't use as much VRAM, and llama.cpp seems to allocate the full KV cache for the context regardless of where in context the input lies - so n_batch matters a lot, but n_past seems to not matter at all.

The call to testModel() can be seen in the UI as the progress bar staying at near 100% before the load completes. With 24 layers of Llama 3 8B, this takes about 2 seconds on my GTX 970 and 0.3 seconds on my Tesla P40. Worst case timing under high memory pressure and a batch size of 512 (which I had to patch in since the upper limit is normally 128) is about 11.2 seconds. At a batch size of 128 I have seen this take as long as 7.6 seconds.

Testing

You can test this PR by choosing a model that does not fit in your card's VRAM and finding a number of layers to offload that just barely doesn't fit. On the main branch, GPT4All can crash either during load or when you are sending input to it. With this PR, an exception is logged to the console during testModel() and GPT4All falls back to CPU as it does for Kompute.

cebtenzzre avatar May 30 '24 22:05 cebtenzzre