litgpt
litgpt copied to clipboard
Out of memory error using Python API but not with CLI
I am loading google/gemma-7b-it on a V100 GPU using CLI and Python API. With CLI it loads perfectly but with Python API, returns Cuda out of memory error.
Same quantization and precision settings have been used
Command -
litgpt generate google/gemma-7b-it --quantize bnb.nf4 --precision bf16-true --max_new_tokens 256
Python code -
from litgpt import LLM
llm = LLM.load("google/gemma-7b-it", quantize="bnb.nf4", precision="bf16-true")
Hello @shubhamworks
Thanks for the report.
The issue is the same as for the chat script: #1558 Here, during load method, kv-cache is created of the size that is equal to maximum context size of the model. It should be changed as in this PR: #1583