MLServer Allow payload request to support extra inference method kwargs

from transformers import LlamaForCausalLM, AutoTokenizer, TextGenerationPipeline
model = LlamaForCausalLM.from_pretrained("daryl149/llama-2-7b-hf",load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("daryl149/llama-2-7b-hf")
pipeline = TextGenerationPipeline(model, tokenizer)
pipeline("Once upon a time,", max_new_tokens=100,return_full_text=False)

max_new_tokens and return_full_text are extra arguments we can passed into pipeline's predict method. max_new_tokens represent maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.

whish they can be passed into the payload request something like:

{
    "inputs": [
        {
            "name": "text_inputs",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["My kitten's name is JoJo,","Tell me a story:"],
        }
    ],
    "inference_kwargs": {
        "max_new_tokens": 200,
    },
}

Aug 23 '23 19:08 nanbo-liu

@adriangonz ,I have a PR for this issue too.https://github.com/SeldonIO/MLServer/pull/1418

Sep 29 '23 14:09 nanbo-liu

Hey @nanbo-liu ,

As discussed in #1418, we don't have much control over the shape of InferenceRequest, which is kept quite agnostic from specific use cases.

However, the good news are that InferenceRequest objects already contain a parameters field that can be used to specify arbitrary parameters. Would this not be enough for your use case?

Following your example, you could have something like:

{
    "inputs": [
        {
            "name": "text_inputs",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["My kitten's name is JoJo,","Tell me a story:"],
        }
    ],
    "parameters": {
        "max_new_tokens": 200,
    },
}

Oct 03 '23 13:10 adriangonz

Hi @adriangonz, we are still running into an issue with this. The python code works fine for passing in new tokens via kwarg or in a config like you listed above when using the mlserver code but when we do a POST via python requests to the "http://localhost:8080/v2/models/transformer/infer" endpoint, the parameters seem to be dropped on the decode step in the predict function in mlserver_huggingface/runtime.py on line 39:

async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        # TODO: convert and validate?
        kwargs = HuggingfaceRequestCodec.decode_request(payload)
        args = kwargs.pop("args", [])

        array_inputs = kwargs.pop("array_inputs", [])
        if array_inputs:
            args = [list(array_inputs)] + args
        
        prediction = self._model(*args, **kwargs)

        return self.encode_response(
            payload=prediction, default_codec=HuggingfaceRequestCodec
        )

We could potentially just extract any parameter kwargs from the payload request and append them to the kwargs list?

Oct 03 '23 15:10 a-palacios

We have the same issue and we don't know how to solve it. How can we enable such parameters?

Oct 09 '23 11:10 rivamarco

Ah I see... that would need some changes to the HF runtime to take into account what's passed via the parameters field - along the lines of what @a-palacios described.

In order to avoid using by mistake other fields of the parameters object, it should probably try to whitelist well known arg names though.

Oct 09 '23 12:10 adriangonz

@adriangonz , I opened up another PR for this: https://github.com/SeldonIO/MLServer/pull/1505

Dec 05 '23 23:12 nanbo-liu

MLServer MLServer copied to clipboard

Allow payload request to support extra inference method kwargs

MLServer
MLServer copied to clipboard