MLServer Adaptive Batching not working as expected

I've been trying to deploy ML model locally with MLServer. My model is saved to an MLFlow model registry. So I start up my server as follows -

export MLSERVER_PARALLEL_WORKERS=5
export MLSERVER_MODEL_MAX_BATCH_SIZE=40
export MLSERVER_MODEL_MAX_BATCH_TIME=0.4

mlflow models serve --model-uri "models:/$modelName/$modelVersion" -h 0.0.0.0 -p $modelPort --env-manager=local --enable-mlserver

When I parallelly hit the grpc endpoint with 3 requests, I expected it to return three response objects with individual prediction scores. But instead, I receive one response object with 3 batched predictions and 2 other response objects with no prediction value

outputs {
  name: "output-1"
  datatype: "FP32"
  shape: 1
  shape: 1
  parameters {
    key: "content_type"
    value {
      string_param: "np"
    }
  }
  contents {
    fp32_contents: <v>
    fp32_contents: <v>
    fp32_contents: <v>
  }
}

I also noticed the same behavior when starting up the server with mlserver start .

Jul 20 '23 16:07 sowmyay

Hi @sowmyay -- Can you please describe how you are sending the 3 requests (e.g. the format of each and the way you are sending them if async or else)?

Jul 21 '23 10:07 ramonpzg

@ramonpzg I am running a load test with locust. It spawns multiple workers to send the requests, and all the requests have the same input values.

Jul 21 '23 12:07 sowmyay

Hey @sowmyay ,

For Adaptive Batching to work, the model needs to return shapes compatible with the number of batches. In general, it will assume that the first dimension is the batch dimension (i.e. [N, .....]) - so the output should share the same value for the "batch dimension" (i.e. also [N, .....]). If there's a mismatch there, MLServer won't be able to know how to split the "batched response" into each individual one - and I suspect that may be what's going on here.

From your issue, I can see that the output your model returns has shape [1, 1] - is this the case?

Jul 21 '23 15:07 adriangonz

Hey @sowmyay ,

For Adaptive Batching to work, the model needs to return shapes compatible with the number of batches. In general, it will assume that the first dimension is the batch dimension (i.e. [N, .....]) - so the output should share the same value for the "batch dimension" (i.e. also [N, .....]). If there's a mismatch there, MLServer won't be able to know how to split the "batched response" into each individual one - and I suspect that may be what's going on here.

From your issue, I can see that the output your model returns has shape [1, 1] - is this the case?

Yes, that's right. It is an xgboost model.

Jul 21 '23 15:07 sowmyay

@adriangonz From the docs, I expected adaptive batching to work as follows, for my use case -

MLServer receives requests of dimensions [1, ....]
Within the acceptable MLSERVER_MODEL_MAX_BATCH_TIME, it will combine incoming requests into a batched request of dimension [N, ....]
Compute prediction on the batched input and produce an output as [N, 1]
It will then split the output into N* [1, 1] response objects and return to the caller.

Steps 1-3 are being executed as expected, but step 4 isn't.

Can you clarify if my understanding is wrong here?

Jul 24 '23 11:07 sowmyay

Hey @sowmyay ,

From what we can see in your output, problem here is that the model doesn't return [N, 1] but [1, 1]. Therefore, the adaptive batching MLServer doesn't have enough info to split the batched response into its different responses.

In fact, what's weird here is that your contents field returns three values but that's not reflected under the shape field.

Could you share a minimum example that helps us to replicate the issue on our side? Particularly, it wold be good to understand what's your MLflow model signature (and / or an example "raw" output of your MLflow model - before being encoded to the Open Inference protocol).

Jul 27 '23 09:07 adriangonz

MLServer MLServer copied to clipboard

Adaptive Batching not working as expected

MLServer
MLServer copied to clipboard