MLServer
MLServer copied to clipboard
Allow payload request to support extra inference method kwargs
from transformers import LlamaForCausalLM, AutoTokenizer, TextGenerationPipeline
model = LlamaForCausalLM.from_pretrained("daryl149/llama-2-7b-hf",load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("daryl149/llama-2-7b-hf")
pipeline = TextGenerationPipeline(model, tokenizer)
pipeline("Once upon a time,", max_new_tokens=100,return_full_text=False)
max_new_tokens and return_full_text are extra arguments we can passed into pipeline's predict method.
max_new_tokens represent maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
whish they can be passed into the payload request something like:
{
"inputs": [
{
"name": "text_inputs",
"shape": [1],
"datatype": "BYTES",
"data": ["My kitten's name is JoJo,","Tell me a story:"],
}
],
"inference_kwargs": {
"max_new_tokens": 200,
},
}
@adriangonz ,I have a PR for this issue too.https://github.com/SeldonIO/MLServer/pull/1418
Hey @nanbo-liu ,
As discussed in #1418, we don't have much control over the shape of InferenceRequest, which is kept quite agnostic from specific use cases.
However, the good news are that InferenceRequest objects already contain a parameters field that can be used to specify arbitrary parameters. Would this not be enough for your use case?
Following your example, you could have something like:
{
"inputs": [
{
"name": "text_inputs",
"shape": [1],
"datatype": "BYTES",
"data": ["My kitten's name is JoJo,","Tell me a story:"],
}
],
"parameters": {
"max_new_tokens": 200,
},
}
Hi @adriangonz, we are still running into an issue with this. The python code works fine for passing in new tokens via kwarg or in a config like you listed above when using the mlserver code but when we do a POST via python requests to the "http://localhost:8080/v2/models/transformer/infer" endpoint, the parameters seem to be dropped on the decode step in the predict function in mlserver_huggingface/runtime.py on line 39:
async def predict(self, payload: InferenceRequest) -> InferenceResponse:
# TODO: convert and validate?
kwargs = HuggingfaceRequestCodec.decode_request(payload)
args = kwargs.pop("args", [])
array_inputs = kwargs.pop("array_inputs", [])
if array_inputs:
args = [list(array_inputs)] + args
prediction = self._model(*args, **kwargs)
return self.encode_response(
payload=prediction, default_codec=HuggingfaceRequestCodec
)
We could potentially just extract any parameter kwargs from the payload request and append them to the kwargs list?
We have the same issue and we don't know how to solve it. How can we enable such parameters?
Ah I see... that would need some changes to the HF runtime to take into account what's passed via the parameters field - along the lines of what @a-palacios described.
In order to avoid using by mistake other fields of the parameters object, it should probably try to whitelist well known arg names though.
@adriangonz , I opened up another PR for this: https://github.com/SeldonIO/MLServer/pull/1505