MLServer
MLServer copied to clipboard
PandasCodec unexpected behavior
Hi,
I am having some problems with PandasCodec and grpc requests.
On the client side, I have the following code to send a pandas df:
data = pd.DataFrame(data={'col1': ["test_string", "test_string"], 'col2': ["test_string", "test_string"]})
i_request = PandasCodec.encode_request(data)
i_request.parameters = types.Parameters(content_type="pd")
inference_request_g = converters.ModelInferRequestConverter.from_types(
i_request, model_name=model_name, model_version=None
)
grpc_channel = grpc.insecure_channel("localhost:8081")
grpc_stub = dataplane.GRPCInferenceServiceStub(grpc_channel)
response = grpc_stub.ModelInfer(inference_request_g)
And I have this code fragment to receive it on the server side:
async def predict(self, payload: types.InferenceRequest) -> types.InferenceResponse:
if payload.parameters.content_type == "pd":
inference_data = PandasCodec.decode_request(payload)
print(inference_data)
I get this output:
col1 col2
0 b'test_string' b'test_string'
1 b'test_string' b'test_string'
The same dataframe, but with binary strings. I was expecting to get as a result the same dataframe that I have in the client, since I am using the same codec for encoding and decoding.
Is this an expected behavior? Maybe I am doing something wrong?
Thanks for your help.
Hey @pablobgar ,
Thanks for providing that high level of detail! :+1:
The PandasCodec will look at the content_types from each individual column (i.e. from each entry in the inputs list) to do any further conversion if required. For inputs with datatype BYTES, it will leave them as they are, unless they are flagged with content_type: "str", in which case it will convert them as strings.
Therefore, my guess is that either the PandasCodec.encode_request() method is not adding is not adding that str label to the string columns, or that the ModelInferRequestConverter.from_types() method is not adding the content_type label into the gRPC payload. To clarify this point, could you share the content of the i_request and inference_request_g payloads?
PS: within an MLModel instance, you can use the self.decode_request(payload) helper method, which will look at the content_type flags for you (i.e. without the need for an explicit if content_type == "...", then ....).
Hi @adriangonz
Thanks for your quick response!!
This is the content of i_request:
{'id': None, 'parameters': {'content_type': 'pd', 'headers': None}, 'inputs': [{'name': 'col1', 'shape': [2], 'datatype': 'BYTES', 'parameters': None, 'data': [b'test_string', b'test_string']}, {'name': 'col2', 'shape': [2], 'datatype': 'BYTES', 'parameters': None, 'data': [b'test_string', b'test_string']}], 'outputs': None}
And this is inference_request_g:
model_name: "test"
parameters {
key: "content_type"
value {
string_param: "pd"
}
}
inputs {
name: "col1"
datatype: "BYTES"
shape: 2
contents {
bytes_contents: "test_string"
bytes_contents: "test_string"
}
}
inputs {
name: "col2"
datatype: "BYTES"
shape: 2
contents {
bytes_contents: "test_string"
bytes_contents: "test_string"
}
}
Hey @pablobgar ,
Right, that seems aligned with what was described on my comment above. The PandasCodec.encode_request method is not setting the right content type at the individual input level.
Could you try explicitly setting str as the content type for the inputs of i_request? As in,
from mlserver.codecs import StringCodec
for input in i_request.inputs:
input.parameters = Parameters(content_type=StringCodec.ContentType)
This is the resulting request:
model_name: "arte"
parameters {
key: "content_type"
value {
string_param: "pd"
}
}
inputs {
name: "col1"
datatype: "BYTES"
shape: 2
parameters {
key: "content_type"
value {
string_param: "str"
}
}
contents {
bytes_contents: "test_string"
bytes_contents: "test_string"
}
}
inputs {
name: "col2"
datatype: "BYTES"
shape: 2
parameters {
key: "content_type"
value {
string_param: "str"
}
}
contents {
bytes_contents: "test_string"
bytes_contents: "test_string"
}
}
But i have the same result on the server:
col1 col2
0 b'test_string' b'test_string'
1 b'test_string' b'test_string'
Hey @pablobgar,
Thanks for trying that out.
Instead of calling the PandasCodec.decode_request(payload) method directly, could you call the self.decode_request(payload, default_codec=PandasCodec) method instead? This method should ensure that each input gets decoded appropriately first .
Hi @adriangonz
I have the same result with self.decode_request.
It seems that decoding is not performed in any case when the content type is bytes and PandasCodec is used. https://github.com/SeldonIO/MLServer/blob/1.1.0/mlserver/codecs/pandas.py#L17-L19
The decoding of the individual columns should occur within the self.decode_request method in the MLModel base class though... That one calls this decode_inference_request helper, which makes a first pass through the entries of the inputs array to see if there is any decoding needed at the input level (e.g. like decoding bytes into strings):
https://github.com/SeldonIO/MLServer/blob/6bb983612720eabff53bbc772591ca7f677ad2e2/mlserver/codecs/utils.py#L129-L130
So it's strange why that still doesn't work for you...
Just to be on the safe side, I've added a small test case on the PR linked below, which should handle the same use case. Do you see any differences with your custom code?
https://github.com/SeldonIO/MLServer/pull/752/files#diff-3b8a58a4d021803b3171b886bb9162fd659e671131f3f61036f9210cb5d0bc7cR186-R204
Hi @adriangonz
Now it is working correctly if I add Parameters(content_type=StringCodec.ContentType). Maybe I did something wrong when I tested it a few days ago, or I didn't test the combination of adding this and use the self.decode_request method, sorry.
This is my code:
data = pd.DataFrame(
data={
"col1": ["test_string", "test_string"],
"col2": ["test_string", "test_string"],
}
)
i_request = PandasCodec.encode_request(data)
i_request.parameters = types.Parameters(content_type="pd")
for input in i_request.inputs:
input.parameters = types.Parameters(content_type=StringCodec.ContentType)
But then where I see a problem is the PandasCodec.encode_request(). Shouldn't it add this parameter to columns that are strings automatically?
Hey @pablobgar ,
Thanks for the update!
On the last point, I totally agree. And TBH, the PandasCodec should also be able to decode strings. To put it another way, since it's the only which knows when a series should be encoded as a string, it should also be able to decode a bytes array into a string series.
We'll keep this issue open to tackle this last point.