litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

`Chat` consumes more VRAM than `Generate`

Open Andrei-Aksionov opened this issue 7 months ago • 7 comments

Bug description

Hi there 👋

While working on integration of Gemma 2 (9b variant) I noticed that while a regular generate script barely fits into L4 with 24 GB, a chat script throws an OOM error. It looks like it's not a Gemma 2 9b specific error.

I tried a couple of other models, in a regular and a quantized form, with a single prompt:

What is the distance between Earth and the Moon?

Machine: Lightning Studio with 1xL4.

Model Chat Generate $Δ$ Chat (bnb.nf4) Generate (bnb.nf4) $Δ$
Phi-3 9.29 7.78 1.51 4.33 2.83 1.5
TinyLlama (chat) 2.60 2.30 0.3 1.38 1.07 0.31
Gemma 1 7b-it 20.58 18.83 1.75 11.45 9.69 1.76
Gemma 2 9b-it OOM 20.58 - 16.48 10.98 5.5

* memory is in GB

Chat is essentially a generate script running in a loop. It should not consume more memory, at least if a single prompt is provided. Since the difference in memory consumption between a regular and a quantized model stays the same, I assume, without even looking at the code, there is something wrong with memory preallocation (kv cache?).

What operating system are you using?

Linux

LitGPT Version

Version: 0.4.3.dev0

Andrei-Aksionov avatar Jul 07 '24 12:07 Andrei-Aksionov