yocto-gl icon indicating copy to clipboard operation
yocto-gl copied to clipboard

[FR] Add Pandas category dtype to mlflow.types.schema

Open henriqueluzz opened this issue 4 years ago • 14 comments

Willingness to contribute

The MLflow Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature (either as an MLflow Plugin or an enhancement to the MLflow code base)?

  • [ ] Yes. I can contribute this feature independently.
  • [X] Yes. I would be willing to contribute this feature with guidance from the MLflow community.
  • [ ] No. I cannot contribute this feature at this time.

Proposal Summary

Add Pandas category dtype, to mlflow.types.schema().

I currently have a ML model based on LightGBM running on production environment that contains categorical variables. Instead of label-encoding or OHE the categorical variables, I'm using Pandas dataframe with categorical columns set to be of the categorical dtype.

df['some_categorical_column'] = df['some_categorical_column'].astype('category') .

I'm trying to deploy this model on Databricks Serve Model that uses MLFlow as backend, but currently this category dtype is not supported for model signature according to the available dtypes on https://www.mlflow.org/docs/latest/_modules/mlflow/types/schema.html, so I'm unable to correctly cast the prediction Dataframe accordingly to the training phase.

It makes me think that deploying models that were trained with LightGBM using Pandas category dtypes, cannot be correctly deployed since this dtype is not availabe, so it would occur some casting error during the prediction leading to incorrect scores/predictions.

I could of course OHE the data and outline the problem, but I'm trying to deploy the current model into MLFlow. I'm not sure if this is a FR or something else.

bug2

What component(s), interfaces, languages, and integrations does this feature affect?

Components

  • [ ] area/artifacts: Artifact stores and artifact logging
  • [ ] area/build: Build and test infrastructure for MLflow
  • [ ] area/docs: MLflow documentation pages
  • [ ] area/examples: Example code
  • [X] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • [X] area/models: MLmodel format, model serialization/deserialization, flavors
  • [ ] area/projects: MLproject format, project running backends
  • [ ] area/scoring: Local serving, model deployment tools, spark UDFs
  • [ ] area/server-infra: MLflow server, JavaScript dev server
  • [ ] area/tracking: Tracking Service, tracking client APIs, autologging

henriqueluzz avatar Dec 15 '20 13:12 henriqueluzz

@tomasatdatabricks @dbczumar is this something that should be supported?

dmatrix avatar Dec 16 '20 18:12 dmatrix

Possibly yes, but it needs a bit of work.

The tricky part about categoricals is that the dictionary can be different between training and inference data. So you need to store the mapping seen during training and make sure you map the data the same way at inference time. You also need to handle categories unseen during training. We would also need to figure out where to store these additional dictionaries - probably as additional artifacts. It can be done, but it is not clear how often would people use this instead of using something like scikit pipeline.

tomasatdatabricks avatar Dec 17 '20 01:12 tomasatdatabricks

Any advances on this? I think it would be useful as there are some algorithms which support null values directly (e.g. XGBoost, which also supports categorical variables in a straightforward way from now on).

olbapjose avatar Sep 07 '22 10:09 olbapjose

I would also like to have this feature to be able to deploy LGBM models easily that use features declared as categorical to use optimal bin sizes https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html#categorical-feature-support ).

frsnjung avatar Sep 22 '22 10:09 frsnjung

I know this is an old issue, but would love to see support for this as well. auto-sklearn makes use of the category dtype as well to add the list of libraries that directly leverage this functionality.

As an intermediate solution, it could be possible to accept the category type by casting to a generic object dtype and ignoring the dictionaries of category values. This obviously isn't ideal, but this is the local workaround that I am currently doing to successfully pass my inputs to infer_signature.

eliwoods avatar Nov 30 '22 22:11 eliwoods

@tomasatdatabricks What are your latest thoughts here?

dbczumar avatar Nov 30 '22 22:11 dbczumar

@dbczumar seems reasonable to support this based on the feedback.

tomasatdatabricks avatar Dec 01 '22 00:12 tomasatdatabricks

Any updates for this request?

tmg-ling avatar Jan 18 '23 16:01 tmg-ling

Any updates on this feature?

atolnix avatar Oct 13 '23 09:10 atolnix

Also interested in an update on this :)

lene-hansen avatar Nov 29 '23 08:11 lene-hansen

Would love to have an update on this as well, spent quite a bit of time trying to debug an issue with serving LightGBM model with categoricals via MLFlow and finding out that this apparently is not possible.

RashidBakirov avatar Nov 29 '23 10:11 RashidBakirov

Would this be related to an issue that I have with running evaluate on an XGBoost model? Getting the error:

File "C:\Python310\lib\site-packages\mlflow\models\utils.py", line 577, in _enforce_named_col_schema new_pf_input[x] = _enforce_mlflow_datatype(x, pf_input[x], input_types[x]) File "C:\Python310\lib\site-packages\mlflow\models\utils.py", line 554, in _enforce_mlflow_datatype raise MlflowException( mlflow.exceptions.MlflowException: Incompatible input types for column Feature_RecruiterState. Can not safely convert category to <U0.

steffyd avatar Mar 25 '24 23:03 steffyd

As an update to this, using "category" type works with MLFlow.

Would this be related to an issue that I have with running evaluate on an XGBoost model? Getting the error:

File "C:\Python310\lib\site-packages\mlflow\models\utils.py", line 577, in _enforce_named_col_schema new_pf_input[x] = _enforce_mlflow_datatype(x, pf_input[x], input_types[x]) File "C:\Python310\lib\site-packages\mlflow\models\utils.py", line 554, in _enforce_mlflow_datatype raise MlflowException( mlflow.exceptions.MlflowException: Incompatible input types for column Feature_RecruiterState. Can not safely convert category to <U0.

I had the same error with LightGBM. In the end I managed to fix this specifying categorical features explicitly, using "categorical_feature" argument.

RashidBakirov avatar Jul 12 '24 18:07 RashidBakirov