lmql
lmql copied to clipboard
`lmql` script not passing args to model invocation
Model args are not processed correctly when invoked like eg:
lmql.model("/home/me/models/phi-1_5", cuda=True, trust_remote_code=True, dtype='float16')
Reproduction:
Fire up an lmql
server:
$ lmql serve-model "distilgpt2" --cuda
Try to swap out models with one that needs --trust_remote_code
import lmql
@lmql.query
def load_big(prompt):
'''
sample(temperature=1.2)
"""
{prompt}[X]
"""
from
lmql.model("/home/me/models/phi-1_5", cuda=True, trust_remote_code=True, dtype='float16') # <<<<<<<<<< These args
where
len(TOKENS(X)) > 10 and len(TOKENS(X)) < 30
'''
print(load_big('Hi there'))
The server throws an error:
File "/home/me/lib/transformers-phi/src/transformers/models/auto/auto_factory.py", line 531, in from_pretrained
config, kwargs = AutoConfig.from_pretrained(
File "/home/me/lib/transformers-phi/src/transformers/models/auto/configuration_auto.py", line 1043, in from_pretrained
trust_remote_code = resolve_trust_remote_code(
File "/home/me/lib/transformers-phi/src/transformers/dynamic_module_utils.py", line 608, in resolve_trust_remote_code
raise ValueError(
ValueError: The repository for /home/me/models/phi-1_5 contains custom code which must be executed to correctlyload the model. You can inspect the repository content at https://hf.co//home/me/models/phi-1_5.
Please pass the argument `trust_remote_code=True` to allow custom code to be run.
This is somewhat by design, but we should probably add better client-side error messages, when users try to configure this from there.
The thinking here is that the concrete inference parameters like quantization and the devices the model is loaded on are server specific. So the client can "select" a model (by identifier), but the configuration under which it is loaded is server-side configuration. If we assume two parties in your example, this allows the inference server provider, to lock down parameters like trust_remote_code
, to make sure only trusted code runs on the server, while still giving clients option to choose a specific model.
I am open for discussion here though, most people probably use lmql serve-model
in a single client setting, where they own both ends.
I would personally prefer more control in my hands, and less opinions baked in (eg per my suggestion to have a server library, where I just have to feed in a function with a specific signature, and the server code stands it up on some port. Or I feed in a custom module.py)
So for instance, if I want to swap in/out models, maybe I need a 4bit of a large model, but a float16 of a smaller one, or some models only play nicely if in bfloat16, not float16 (falcon comes to mind, although Mistral's made it irrelevant now.) Then there's trust_remote_code
, and flash_attention_2
that not all models support yet. And I'm sure other models will hvae their own custom flags.
The alternative is, I guess, to launch different model config sets on different ports, but then you absolutely need the ability to unload models, from the client. This also means now that unload logic must be handled by the client, but, that may not be so bad.
I am working on enabling this soon, it requires some more changes with respect to stopping generation early though, so it will not be immediately available.
One thing that may be interesting here, is lmql.serve
, which is a way to indeed configure the model in the same process that it will actually run in. See this snippet for an example. When launching an inference endpoint via lmql.serve
, the parameters you provide will be passed through all the way to the from_pretrained
(HF) or Llama(...)
(llama.cpp) calls.
Still, I do want to enable this more generally. To host multiple model configurations in parallel, I would indeed advise to use multiple lmql serve-model
/lmql.serve
processes that serve differently configured models on different ports/endpoints.