litgpt `Chat` consumes more VRAM than `Generate`

`Chat` consumes more VRAM than `Generate`

Open Andrei-Aksionov opened this issue 7 months ago • 7 comments

Bug description

Hi there 👋

While working on integration of Gemma 2 (9b variant) I noticed that while a regular generate script barely fits into L4 with 24 GB, a chat script throws an OOM error. It looks like it's not a Gemma 2 9b specific error.

I tried a couple of other models, in a regular and a quantized form, with a single prompt:

What is the distance between Earth and the Moon?

Machine: Lightning Studio with 1xL4.

Model	Chat	Generate	$Δ$	Chat (bnb.nf4)	Generate (bnb.nf4)	$Δ$
Phi-3	9.29	7.78	1.51	4.33	2.83	1.5
TinyLlama (chat)	2.60	2.30	0.3	1.38	1.07	0.31
Gemma 1 7b-it	20.58	18.83	1.75	11.45	9.69	1.76
Gemma 2 9b-it	OOM	20.58	-	16.48	10.98	5.5

* memory is in GB

Chat is essentially a generate script running in a loop. It should not consume more memory, at least if a single prompt is provided. Since the difference in memory consumption between a regular and a quantized model stays the same, I assume, without even looking at the code, there is something wrong with memory preallocation (kv cache?).

What operating system are you using?

Linux

LitGPT Version

Version: 0.4.3.dev0

Jul 07 '24 12:07 Andrei-Aksionov

litgpt litgpt copied to clipboard

`Chat` consumes more VRAM than `Generate`

Bug description

What operating system are you using?

LitGPT Version

litgpt
litgpt copied to clipboard