MLServer
MLServer copied to clipboard
gRPC fails with inferred f16 numpy array
I think I discovered a bug in the current gRPC code in mlserver. I have a model that returns float16 arrays and I tried to get predictions via gRPC. I could narrow down the issue to this example without any client-server complexity.
Reproduce error
import numpy as np
from mlserver.codecs.decorator import SignatureCodec
import mlserver.grpc.converters as converters
def a() -> np.ndarray:
return np.array([[1.123, 4], [1, 3], [1, 2]], dtype=np.float16)
codec = SignatureCodec(a)
r = codec.encode_response(payload=a(), model_name="x")
converters.ModelInferResponseConverter.from_types(r)
The last line yields
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.12/site-packages/mlserver/grpc/converters.py", line 380, in from_types
InferOutputTensorConverter.from_types(output)
File "/usr/local/lib/python3.12/site-packages/mlserver/grpc/converters.py", line 425, in from_types
contents=InferTensorContentsConverter.from_types(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/mlserver/grpc/converters.py", line 335, in from_types
return pb.InferTensorContents(**contents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected bytes, float found
Root cause
I think the root cause is in the gRPC type-to-field mapping:
_FIELDS = {
...
"FP16": "bytes_contents",
"FP32": "fp32_contents",
"FP64": "fp64_contents",
"BYTES": "bytes_contents",
}
The code uses bytes in the dataplane for FP16 inputs. The dataplane doesn't even offer a fp16_contents field that could be used for the purpose. (Is it because protobuf doesn't support fp16 by default?)
Potential fix
I think in this case, the fp32_content should be used in the gRPC type-to-field mapping. Although, this wastes half of the bandwidth.
I just discovered the following in the open inference protocol:
message ModelInferResponse
{
// ...
// The output tensors holding inference results.
repeated InferOutputTensor outputs = 5;
// The data contained in an output tensor can be represented in
// "raw" bytes form or in the repeated type that matches the
// tensor's data type. To use the raw representation 'raw_output_contents'
// must be initialized with data for each tensor in the same order as
// 'outputs'. For each tensor, the size of this content must match
// what is expected by the tensor's shape and data type. The raw
// data must be the flattened, one-dimensional, row-major order of
// the tensor elements without any stride or padding between the
// elements. Note that the FP16 and BF16 data types must be represented as
// raw content as there is no specific data type for a 16-bit float type.
//
// If this field is specified then InferOutputTensor::contents must
// not be specified for any output tensor.
repeated bytes raw_output_contents = 6;
So, 16-bit floats should actually go to raw_output_contents. Not sure, why this didn't work in my case