Out of memory error using Python API but not with CLI

Open shubhamworks opened this issue 1 year ago • 1 comments

I am loading google/gemma-7b-it on a V100 GPU using CLI and Python API. With CLI it loads perfectly but with Python API, returns Cuda out of memory error.

Same quantization and precision settings have been used

Command - litgpt generate google/gemma-7b-it --quantize bnb.nf4 --precision bf16-true --max_new_tokens 256

Python code -

from litgpt import LLM
llm = LLM.load("google/gemma-7b-it", quantize="bnb.nf4", precision="bf16-true")

Jul 15 '24 03:07 shubhamworks

Hello @shubhamworks

Thanks for the report.

The issue is the same as for the chat script: #1558 Here, during load method, kv-cache is created of the size that is equal to maximum context size of the model. It should be changed as in this PR: #1583

Jul 15 '24 10:07 Andrei-Aksionov