lmql icon indicating copy to clipboard operation
lmql copied to clipboard

`lmql` script not passing args to model invocation

Open freckletonj opened this issue 1 year ago • 3 comments

Model args are not processed correctly when invoked like eg:

lmql.model("/home/me/models/phi-1_5", cuda=True, trust_remote_code=True, dtype='float16')

Reproduction:

Fire up an lmql server:

$ lmql serve-model "distilgpt2" --cuda

Try to swap out models with one that needs --trust_remote_code

import lmql

@lmql.query
def load_big(prompt):
    '''
sample(temperature=1.2)
"""
    {prompt}[X]
"""
from
    lmql.model("/home/me/models/phi-1_5", cuda=True, trust_remote_code=True, dtype='float16')  # <<<<<<<<<< These args
where
    len(TOKENS(X)) > 10 and len(TOKENS(X)) < 30
'''

print(load_big('Hi there'))

The server throws an error:

  File "/home/me/lib/transformers-phi/src/transformers/models/auto/auto_factory.py", line 531, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
  File "/home/me/lib/transformers-phi/src/transformers/models/auto/configuration_auto.py", line 1043, in from_pretrained
    trust_remote_code = resolve_trust_remote_code(
  File "/home/me/lib/transformers-phi/src/transformers/dynamic_module_utils.py", line 608, in resolve_trust_remote_code
    raise ValueError(
ValueError: The repository for /home/me/models/phi-1_5 contains custom code which must be executed to correctlyload the model. You can inspect the repository content at https://hf.co//home/me/models/phi-1_5.
Please pass the argument `trust_remote_code=True` to allow custom code to be run.

freckletonj avatar Oct 05 '23 20:10 freckletonj

This is somewhat by design, but we should probably add better client-side error messages, when users try to configure this from there.

The thinking here is that the concrete inference parameters like quantization and the devices the model is loaded on are server specific. So the client can "select" a model (by identifier), but the configuration under which it is loaded is server-side configuration. If we assume two parties in your example, this allows the inference server provider, to lock down parameters like trust_remote_code, to make sure only trusted code runs on the server, while still giving clients option to choose a specific model.

I am open for discussion here though, most people probably use lmql serve-model in a single client setting, where they own both ends.

lbeurerkellner avatar Oct 06 '23 15:10 lbeurerkellner

I would personally prefer more control in my hands, and less opinions baked in (eg per my suggestion to have a server library, where I just have to feed in a function with a specific signature, and the server code stands it up on some port. Or I feed in a custom module.py)

So for instance, if I want to swap in/out models, maybe I need a 4bit of a large model, but a float16 of a smaller one, or some models only play nicely if in bfloat16, not float16 (falcon comes to mind, although Mistral's made it irrelevant now.) Then there's trust_remote_code, and flash_attention_2 that not all models support yet. And I'm sure other models will hvae their own custom flags.

The alternative is, I guess, to launch different model config sets on different ports, but then you absolutely need the ability to unload models, from the client. This also means now that unload logic must be handled by the client, but, that may not be so bad.

freckletonj avatar Oct 06 '23 17:10 freckletonj

I am working on enabling this soon, it requires some more changes with respect to stopping generation early though, so it will not be immediately available.

One thing that may be interesting here, is lmql.serve, which is a way to indeed configure the model in the same process that it will actually run in. See this snippet for an example. When launching an inference endpoint via lmql.serve, the parameters you provide will be passed through all the way to the from_pretrained (HF) or Llama(...) (llama.cpp) calls.

Still, I do want to enable this more generally. To host multiple model configurations in parallel, I would indeed advise to use multiple lmql serve-model/lmql.serve processes that serve differently configured models on different ports/endpoints.

lbeurerkellner avatar Oct 14 '23 21:10 lbeurerkellner