MLServer icon indicating copy to clipboard operation
MLServer copied to clipboard

PandasCodec unexpected behavior

Open pablobgar opened this issue 3 years ago • 9 comments
trafficstars

Hi,

I am having some problems with PandasCodec and grpc requests.

On the client side, I have the following code to send a pandas df:


data = pd.DataFrame(data={'col1': ["test_string", "test_string"], 'col2': ["test_string", "test_string"]})

i_request = PandasCodec.encode_request(data)
i_request.parameters = types.Parameters(content_type="pd")

inference_request_g = converters.ModelInferRequestConverter.from_types(
    i_request, model_name=model_name, model_version=None
)

grpc_channel = grpc.insecure_channel("localhost:8081")
grpc_stub = dataplane.GRPCInferenceServiceStub(grpc_channel)
response = grpc_stub.ModelInfer(inference_request_g)

And I have this code fragment to receive it on the server side:


async def predict(self, payload: types.InferenceRequest) -> types.InferenceResponse:

        if payload.parameters.content_type == "pd":
            inference_data = PandasCodec.decode_request(payload)

            print(inference_data)

I get this output:

             col1            col2
0  b'test_string'  b'test_string'
1  b'test_string'  b'test_string'

The same dataframe, but with binary strings. I was expecting to get as a result the same dataframe that I have in the client, since I am using the same codec for encoding and decoding.

Is this an expected behavior? Maybe I am doing something wrong?

Thanks for your help.

pablobgar avatar Sep 22 '22 15:09 pablobgar

Hey @pablobgar ,

Thanks for providing that high level of detail! :+1:

The PandasCodec will look at the content_types from each individual column (i.e. from each entry in the inputs list) to do any further conversion if required. For inputs with datatype BYTES, it will leave them as they are, unless they are flagged with content_type: "str", in which case it will convert them as strings.

Therefore, my guess is that either the PandasCodec.encode_request() method is not adding is not adding that str label to the string columns, or that the ModelInferRequestConverter.from_types() method is not adding the content_type label into the gRPC payload. To clarify this point, could you share the content of the i_request and inference_request_g payloads?

PS: within an MLModel instance, you can use the self.decode_request(payload) helper method, which will look at the content_type flags for you (i.e. without the need for an explicit if content_type == "...", then ....).

adriangonz avatar Sep 22 '22 16:09 adriangonz

Hi @adriangonz

Thanks for your quick response!!

This is the content of i_request:

{'id': None, 'parameters': {'content_type': 'pd', 'headers': None}, 'inputs': [{'name': 'col1', 'shape': [2], 'datatype': 'BYTES', 'parameters': None, 'data': [b'test_string', b'test_string']}, {'name': 'col2', 'shape': [2], 'datatype': 'BYTES', 'parameters': None, 'data': [b'test_string', b'test_string']}], 'outputs': None}

And this is inference_request_g:


model_name: "test"
parameters {
  key: "content_type"
  value {
    string_param: "pd"
  }
}
inputs {
  name: "col1"
  datatype: "BYTES"
  shape: 2
  contents {
    bytes_contents: "test_string"
    bytes_contents: "test_string"
  }
}
inputs {
  name: "col2"
  datatype: "BYTES"
  shape: 2
  contents {
    bytes_contents: "test_string"
    bytes_contents: "test_string"
  }
}

pablobgar avatar Sep 23 '22 11:09 pablobgar

Hey @pablobgar ,

Right, that seems aligned with what was described on my comment above. The PandasCodec.encode_request method is not setting the right content type at the individual input level.

Could you try explicitly setting str as the content type for the inputs of i_request? As in,

from mlserver.codecs import StringCodec

for input in i_request.inputs:
	input.parameters = Parameters(content_type=StringCodec.ContentType)

adriangonz avatar Sep 23 '22 11:09 adriangonz

This is the resulting request:


model_name: "arte"
parameters {
  key: "content_type"
  value {
    string_param: "pd"
  }
}
inputs {
  name: "col1"
  datatype: "BYTES"
  shape: 2
  parameters {
    key: "content_type"
    value {
      string_param: "str"
    }
  }
  contents {
    bytes_contents: "test_string"
    bytes_contents: "test_string"
  }
}
inputs {
  name: "col2"
  datatype: "BYTES"
  shape: 2
  parameters {
    key: "content_type"
    value {
      string_param: "str"
    }
  }
  contents {
    bytes_contents: "test_string"
    bytes_contents: "test_string"
  }
}

But i have the same result on the server:

             col1            col2
0  b'test_string'  b'test_string'
1  b'test_string'  b'test_string'

pablobgar avatar Sep 23 '22 12:09 pablobgar

Hey @pablobgar,

Thanks for trying that out.

Instead of calling the PandasCodec.decode_request(payload) method directly, could you call the self.decode_request(payload, default_codec=PandasCodec) method instead? This method should ensure that each input gets decoded appropriately first .

adriangonz avatar Sep 26 '22 16:09 adriangonz

Hi @adriangonz

I have the same result with self.decode_request.

It seems that decoding is not performed in any case when the content type is bytes and PandasCodec is used. https://github.com/SeldonIO/MLServer/blob/1.1.0/mlserver/codecs/pandas.py#L17-L19

pablobgar avatar Sep 27 '22 11:09 pablobgar

The decoding of the individual columns should occur within the self.decode_request method in the MLModel base class though... That one calls this decode_inference_request helper, which makes a first pass through the entries of the inputs array to see if there is any decoding needed at the input level (e.g. like decoding bytes into strings):

https://github.com/SeldonIO/MLServer/blob/6bb983612720eabff53bbc772591ca7f677ad2e2/mlserver/codecs/utils.py#L129-L130

So it's strange why that still doesn't work for you...

Just to be on the safe side, I've added a small test case on the PR linked below, which should handle the same use case. Do you see any differences with your custom code?

https://github.com/SeldonIO/MLServer/pull/752/files#diff-3b8a58a4d021803b3171b886bb9162fd659e671131f3f61036f9210cb5d0bc7cR186-R204

adriangonz avatar Sep 28 '22 15:09 adriangonz

Hi @adriangonz

Now it is working correctly if I add Parameters(content_type=StringCodec.ContentType). Maybe I did something wrong when I tested it a few days ago, or I didn't test the combination of adding this and use the self.decode_request method, sorry.

This is my code:

data = pd.DataFrame(
    data={
        "col1": ["test_string", "test_string"],
        "col2": ["test_string", "test_string"],
    }
)

i_request = PandasCodec.encode_request(data)
i_request.parameters = types.Parameters(content_type="pd")

for input in i_request.inputs:
    input.parameters = types.Parameters(content_type=StringCodec.ContentType)

But then where I see a problem is the PandasCodec.encode_request(). Shouldn't it add this parameter to columns that are strings automatically?

pablobgar avatar Sep 28 '22 17:09 pablobgar

Hey @pablobgar ,

Thanks for the update!

On the last point, I totally agree. And TBH, the PandasCodec should also be able to decode strings. To put it another way, since it's the only which knows when a series should be encoded as a string, it should also be able to decode a bytes array into a string series.

We'll keep this issue open to tackle this last point.

adriangonz avatar Sep 29 '22 16:09 adriangonz