litgpt
litgpt copied to clipboard
`Chat` consumes more VRAM than `Generate`
Bug description
Hi there 👋
While working on integration of Gemma 2 (9b variant) I noticed that while a regular generate script barely fits into L4 with 24 GB, a chat script throws an OOM error. It looks like it's not a Gemma 2 9b specific error.
I tried a couple of other models, in a regular and a quantized form, with a single prompt:
What is the distance between Earth and the Moon?
Machine: Lightning Studio with 1xL4.
Model | Chat | Generate | $Δ$ | Chat (bnb.nf4) | Generate (bnb.nf4) | $Δ$ |
---|---|---|---|---|---|---|
Phi-3 | 9.29 | 7.78 | 1.51 | 4.33 | 2.83 | 1.5 |
TinyLlama (chat) | 2.60 | 2.30 | 0.3 | 1.38 | 1.07 | 0.31 |
Gemma 1 7b-it | 20.58 | 18.83 | 1.75 | 11.45 | 9.69 | 1.76 |
Gemma 2 9b-it | OOM | 20.58 | - | 16.48 | 10.98 | 5.5 |
* memory is in GB
Chat is essentially a generate script running in a loop. It should not consume more memory, at least if a single prompt is provided. Since the difference in memory consumption between a regular and a quantized model stays the same, I assume, without even looking at the code, there is something wrong with memory preallocation (kv cache?).
What operating system are you using?
Linux
LitGPT Version
Version: 0.4.3.dev0