MLServer icon indicating copy to clipboard operation
MLServer copied to clipboard

Support pandas categorical datatype

Open johnpaulett opened this issue 3 years ago • 6 comments

Looking for feedback to possibly create a PR.

Problem:

LightGBM expects inference requests to set columns to category if the model was trained with categorical column(s). As is, mlserver will return an error for a categorical LightGBM model, ValueError: train and valid dataset categorical_feature do not match, since the call mlserver.codecs.pandas.PandasCodec.decode_request will simply treat the value as an object/float64/int64 dtype.

I've opened an issue with LightGBM (https://github.com/microsoft/LightGBM/issues/5244) as I believe that LightGBM has enough information in its saved model file to do this conversion itself, but so far the discussion seem to push this towards being the responsibility of the caller of LightGBM.

Opportunity

However, I think mlserver's Content Type support possibly could allow me to work around this by explicitly setting the content_type / datatype to "category" the same way I can set it to str or FP32. E.g. if the mlserver.codecs.pandas.PandasCodec.to_series would use pd.Series(payload, dtype="category") it would pass categorical data into LightGBM.

Options

Looking for feedback, I see several options

  1. Adjust to_series / to_dtype to handle CATEGORY datatype (though would this violate the v2 spec?)
  2. Set as a content_type more similar to the input-level StringCodec? So the request would have a "content_type": "pd" with one or more inputs as "content_type": "category".
  3. Adjust the lightgbm runtime to read the saved model file and use DataFrame.astype() to convert any trained categorical column to category in the request dataframe. Personally, this feels like a hacky workaround for what LightGBM should be doing internally, and I double such a solution should end up in mlserver. A proof-of-concept in kserver's lgbserver: https://github.com/kserve/kserve/pull/2208

I believe that I could package options 1/2 in a custom image, but it feels like it may be valuable to add to mlserver?

Any suggestions or ideas are welcome. I'd be open to coding a PR for route 2. I'm probably going to try this approach in a custom image as a short-term workaround anyway, but open to contributing upstream.

johnpaulett avatar Jun 14 '22 09:06 johnpaulett

Here is code for option 3: https://github.com/johnpaulett/MLServer/commit/f9e4f72a552bd43a613386a214e929cceb7e0a13. I suspect this is undesirable to merge into the MLServer's lightgbm runtime, since it requires far too much internal knowledge about the LightGBM .bst file

I started exploring option 2, but am concerned about what to do with a category of str (a Category InputCodec could need to wrap both StringCodec and the PandasCodec's _to_series):

data = {
  "parameters": {
    "content_type": "pd"
  },
  "inputs": [
    {
      "name": "procedureId",
      "data": ["New York", "London", "New York"],
      "datatype": "BYTES",
      "parameters": {
        "content_type": "category"
      },
      "shape": [1],
    },
   ...

johnpaulett avatar Jun 14 '22 12:06 johnpaulett

Hey @johnpaulett ,

Now that #630 has been merged, should we close this one?

adriangonz avatar Jun 15 '22 13:06 adriangonz

@adriangonz This is a separate problem from #630.

#630 was an issue with None -> np.Nan.

This is an issue with specifying columns as Categorical in pandas. I have a hacky workaround (https://github.com/johnpaulett/MLServer/commit/f9e4f72a552bd43a613386a214e929cceb7e0a13), but could use some thoughts on if there is someway to use the Content Types to indicate an input should be converted using pd.Series(..., dtype="category") -- perhaps in the PandasCodec or some new CategoryCodec.

johnpaulett avatar Jun 15 '22 16:06 johnpaulett

Hey @johnpaulett ,

Sorry for the delay getting back to you.

I see what you mean now, and I totally agree that option 2. looks like the clear winner here. The Pandas Codec could then pick up that info from the input's content_type.

Thinking out loud, is there any extra info that Pandas needs for categorical features? E.g. like the full set of valid categories?

adriangonz avatar Jun 27 '22 09:06 adriangonz

is there any extra info that Pandas needs for categorical features? E.g. like the full set of valid categories?

Not for my LightGBM use case -- that library will "realign" the categories to match the trained categories, but it just needs that input DataFrame to have the same columns marked as dtype category. In my limited experience with the pandas Categorical, I think it mostly refers to the categories in the given dataframe, so I don't think having a wider list of categories would be needed, but I'm not 100% sure.

The one item that I am a little concerned about with this approach is I will probably assume that any "datatype": "BYTES" is a UTF8 str (e.g. use the StringCodec) when content_type: 'pd'. Practically, I suspect it unlikely that many of BYTES use cases (e.g. image data) would be sent in a categorical content_type.

johnpaulett avatar Jun 27 '22 12:06 johnpaulett

Thinking out loud, is there any extra info that Pandas needs for categorical features? E.g. like the full set of valid categories?

This is a good point, I can imagine a scenario where request-to-request some categories are missing, but the series encoding, without any additional information, might result this in incompatible ways from request-to-request (need to check...).

jklaise avatar Jun 05 '23 10:06 jklaise