MLServer
MLServer copied to clipboard
Simplify interface to write custom runtimes
When writing custom runtimes, MLServer requires the user to know a bit about the V2 request and response structure, as well as codecs. This can add some friction when it comes to writing custom code.
When working in Tempo, one of the main aims was to simplify how “custom models” were implemented. Therefore, it would be interesting to revisit some of these ideas to help simplify MLServer runtimes.
With this goal in mind, it would be interesting to let the user write runtimes like the one below:
class MyCustomRuntime(MLModel):
# Use signatures to declare the expected request and response content types
async def predict(self, payload: np.array) -> np.array:
# Do something with the payload
pred = payload.sum(keepdims=1)
# Return result as is
return pred
Under the hood, MLServer could look at the Python type hints to determine the right content types that should be used, and add them to the model metadata. This would allow MLServer to, on-the-fly, find the right set of codecs and encode / decode the request and response.
Multiple Inputs / Outputs
The V2 protocol lets the user send and return multiple input and output “heads”. To account for this, the new simplified interface could use the names of the function arguments to match them with the incoming request:
class MyCustomRuntime(MLModel):
# Input `foo` would get passed as kwarg `foo`, and input `bar` as kwarg `bar`
async def predict(self, foo: np.array, bar: np.array) -> np.array:
...
When encoding back the response though, the signature itself doesn’t show any info about the expected names. Therefore, MLServer will need to:
- Look at the
outputsfield of the model metadata, and match the returned values based on their order. - In the absence of the above, make up some default names (e.g.
output-0,output-1, etc.).
Note that this doesn’t include function signatures with a single “multi-input/-output” return type, like a Pandas Dataframe. In these cases, MLServer will look at the column names to infer the right output names.
Variable Content Types & Advanced Use Cases
In some cases, the approach suggested above may show some limitations though. For example, if we think of runtimes which accept a variable set of content types, like the MLflow or SKLearn runtimes (which accept either dataframes or numpy arrays), or other advanced use cases.
However, these use cases should still be supported by the current approach which would remain valid. That is, when using the “low-level” InferenceRequest and InferenceResponse types (which correspond to the respective V2 payloads), the encoding / decoding will be the user’s responsibility (who will need to call something like self.decode).