MLServer icon indicating copy to clipboard operation
MLServer copied to clipboard

Allow payload request to support extra inference method kwargs

Open nanbo-liu opened this issue 2 years ago • 6 comments

from transformers import LlamaForCausalLM, AutoTokenizer, TextGenerationPipeline
model = LlamaForCausalLM.from_pretrained("daryl149/llama-2-7b-hf",load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("daryl149/llama-2-7b-hf")
pipeline = TextGenerationPipeline(model, tokenizer)
pipeline("Once upon a time,", max_new_tokens=100,return_full_text=False)

max_new_tokens and return_full_text are extra arguments we can passed into pipeline's predict method. max_new_tokens represent maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.

whish they can be passed into the payload request something like:

{
    "inputs": [
        {
            "name": "text_inputs",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["My kitten's name is JoJo,","Tell me a story:"],
        }
    ],
    "inference_kwargs": {
        "max_new_tokens": 200,
    },
}

nanbo-liu avatar Aug 23 '23 19:08 nanbo-liu

@adriangonz ,I have a PR for this issue too.https://github.com/SeldonIO/MLServer/pull/1418

nanbo-liu avatar Sep 29 '23 14:09 nanbo-liu

Hey @nanbo-liu ,

As discussed in #1418, we don't have much control over the shape of InferenceRequest, which is kept quite agnostic from specific use cases.

However, the good news are that InferenceRequest objects already contain a parameters field that can be used to specify arbitrary parameters. Would this not be enough for your use case?

Following your example, you could have something like:

{
    "inputs": [
        {
            "name": "text_inputs",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["My kitten's name is JoJo,","Tell me a story:"],
        }
    ],
    "parameters": {
        "max_new_tokens": 200,
    },
}

adriangonz avatar Oct 03 '23 13:10 adriangonz

Hi @adriangonz, we are still running into an issue with this. The python code works fine for passing in new tokens via kwarg or in a config like you listed above when using the mlserver code but when we do a POST via python requests to the "http://localhost:8080/v2/models/transformer/infer" endpoint, the parameters seem to be dropped on the decode step in the predict function in mlserver_huggingface/runtime.py on line 39:

async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        # TODO: convert and validate?
        kwargs = HuggingfaceRequestCodec.decode_request(payload)
        args = kwargs.pop("args", [])

        array_inputs = kwargs.pop("array_inputs", [])
        if array_inputs:
            args = [list(array_inputs)] + args
        
        prediction = self._model(*args, **kwargs)

        return self.encode_response(
            payload=prediction, default_codec=HuggingfaceRequestCodec
        )

We could potentially just extract any parameter kwargs from the payload request and append them to the kwargs list?

a-palacios avatar Oct 03 '23 15:10 a-palacios

We have the same issue and we don't know how to solve it. How can we enable such parameters?

rivamarco avatar Oct 09 '23 11:10 rivamarco

Ah I see... that would need some changes to the HF runtime to take into account what's passed via the parameters field - along the lines of what @a-palacios described.

In order to avoid using by mistake other fields of the parameters object, it should probably try to whitelist well known arg names though.

adriangonz avatar Oct 09 '23 12:10 adriangonz

@adriangonz , I opened up another PR for this: https://github.com/SeldonIO/MLServer/pull/1505

nanbo-liu avatar Dec 05 '23 23:12 nanbo-liu