lmql
lmql copied to clipboard
How to unload models from server?
I need to be able to control the lifecycle of models in VRAM to work on smaller devices.
My first approach was to use your local:the_model syntax.
At some pain, I discovered this was breaking if I ever import spacy, or I'm sure many other libs as well, like FastAPI which I'll need:
import spacy
run_prompt('hi there')
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
This is a pytorch issue: https://github.com/pytorch/pytorch/issues/40403
So my next thought was to read through documentation, and find a way to load/unload models. It doesn't exist.
So my next thought, a total hack, was to "unload" a large model by loading in a smaller model:
@lmql.query
def load_big(prompt):
'''
sample(temperature=1.2)
"""
{prompt}[X]
"""
from
lmql.model("/home/me/models/llama", cuda=True, trust_remote_code=True, dtype='float16')
where
len(TOKENS(X)) > 10 and len(TOKENS(X)) < 30
'''
@lmql.query
def load_small(prompt):
'''
sample(temperature=1.2)
"""
{prompt}[X]
"""
from
lmql.model("distilgpt2", cuda=True, trust_remote_code=True, dtype='float16')
where
len(TOKENS(X)) > 10 and len(TOKENS(X)) < 30
'''
print(load_small('Hi there'))
print(load_big('Hi there'))
print(load_small('Hi there'))
(aside: the flags here do not get properly passed in to the model, eg in lmql.model("/home/me/models/llama", cuda=True, trust_remote_code=True, dtype='float16'). So I was needing to boot up the server with these flags applied to the small model, where they're irrelevant, and then they'd eventually also get passed to the larger model when eventually summoned.)
When I load_small, it does indeed load the smaller model, but it does not unload the bigger model.
Options:
- It'd be nice if there was a server library, then I could wrap it in an API, and control the lifecycle of models directly myself, including getting the right flags to the right models. This would also help me with other issues I was facing previously, of getting GPTQ'd models to work.
- I get that a user might want 2 models in VRAM so they could switch between them quickly. So an option to not do that, but to
del model; gc.collect()or something would be nice, while keeping the server running.
Regarding my aside:... in there, I added an issue that addresses the failure of args to flow through to model instantiation: https://github.com/eth-sri/lmql/issues/230
Hm, I left a process running all day after changing to n=1 in that PR, and looking at the logs they're full of loading and unloading the model each time (twice!).
So, to be clear:
lmql serve-model "small-model"- When I need a big model, I let an lmql script load it in:
lmql.model("/home/me/models/phi-1_5", cuda=True, trust_remote_code=True, dtype='float16') - And see in the logs that it's super slow, and loading+unloading for each execution of a generation:
[Unloading /home/me/models/phi-1_5]
[Loading /home/me/models/phi-1_5 with AutoModelForCausalLM.from_pretrained("/home/me/models/phi-1_5", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map=auto)]
[Unloading /home/me/models/phi-1_5]
[/home/me/models/phi-1_5 ready on device cuda:0]
[Loading /home/me/models/phi-1_5 with AutoModelForCausalLM.from_pretrained("/home/me/models/phi-1_5", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map=auto)]
[/home/me/models/phi-1_5 ready on device cuda:0]
Per-execution, it does all that, so, it looks like it unloads and loads it twice. I will note, that for only the first call that swaps models, it merely does this:
[Loading /home/me/models/phi-1_5 with AutoModelForCausalLM.from_pretrained("/home/me/models/phi-1_5", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map=auto)]
[/home/me/models/phi-1_5 ready on device cuda:0]
But for all other calls after, it loads+unloads twice
Thanks for the PR, conceptually the changes look good to me.
So the logs we are seeing, suggest that the inference server somehow switches between both model continuously, even though there is only an active user for the big model, is that correct?
I would expect it to switch to the big model until the small model is requested again.
I would also expect it to stay on the last used model.
And to make matters worse, the logs suggest it's loading and unloading twice per generation.