MLServer
MLServer copied to clipboard
Adaptive Batching not working as expected
I've been trying to deploy ML model locally with MLServer. My model is saved to an MLFlow model registry. So I start up my server as follows -
export MLSERVER_PARALLEL_WORKERS=5
export MLSERVER_MODEL_MAX_BATCH_SIZE=40
export MLSERVER_MODEL_MAX_BATCH_TIME=0.4
mlflow models serve --model-uri "models:/$modelName/$modelVersion" -h 0.0.0.0 -p $modelPort --env-manager=local --enable-mlserver
When I parallelly hit the grpc endpoint with 3 requests, I expected it to return three response objects with individual prediction scores. But instead, I receive one response object with 3 batched predictions and 2 other response objects with no prediction value
outputs {
name: "output-1"
datatype: "FP32"
shape: 1
shape: 1
parameters {
key: "content_type"
value {
string_param: "np"
}
}
contents {
fp32_contents: <v>
fp32_contents: <v>
fp32_contents: <v>
}
}
I also noticed the same behavior when starting up the server with mlserver start .
Hi @sowmyay -- Can you please describe how you are sending the 3 requests (e.g. the format of each and the way you are sending them if async or else)?
@ramonpzg I am running a load test with locust. It spawns multiple workers to send the requests, and all the requests have the same input values.
Hey @sowmyay ,
For Adaptive Batching to work, the model needs to return shapes compatible with the number of batches. In general, it will assume that the first dimension is the batch dimension (i.e. [N, .....]) - so the output should share the same value for the "batch dimension" (i.e. also [N, .....]). If there's a mismatch there, MLServer won't be able to know how to split the "batched response" into each individual one - and I suspect that may be what's going on here.
From your issue, I can see that the output your model returns has shape [1, 1] - is this the case?
Hey @sowmyay ,
For Adaptive Batching to work, the model needs to return shapes compatible with the number of batches. In general, it will assume that the first dimension is the batch dimension (i.e.
[N, .....]) - so the output should share the same value for the "batch dimension" (i.e. also[N, .....]). If there's a mismatch there, MLServer won't be able to know how to split the "batched response" into each individual one - and I suspect that may be what's going on here.From your issue, I can see that the output your model returns has shape
[1, 1]- is this the case?
Yes, that's right. It is an xgboost model.
@adriangonz From the docs, I expected adaptive batching to work as follows, for my use case -
- MLServer receives requests of dimensions [1, ....]
- Within the acceptable
MLSERVER_MODEL_MAX_BATCH_TIME, it will combine incoming requests into a batched request of dimension [N, ....] - Compute prediction on the batched input and produce an output as [N, 1]
- It will then split the output into N* [1, 1] response objects and return to the caller.
Steps 1-3 are being executed as expected, but step 4 isn't.
Can you clarify if my understanding is wrong here?
Hey @sowmyay ,
From what we can see in your output, problem here is that the model doesn't return [N, 1] but [1, 1]. Therefore, the adaptive batching MLServer doesn't have enough info to split the batched response into its different responses.
In fact, what's weird here is that your contents field returns three values but that's not reflected under the shape field.
Could you share a minimum example that helps us to replicate the issue on our side? Particularly, it wold be good to understand what's your MLflow model signature (and / or an example "raw" output of your MLflow model - before being encoded to the Open Inference protocol).