yocto-gl
yocto-gl copied to clipboard
[FR] Add Pandas category dtype to mlflow.types.schema
Willingness to contribute
The MLflow Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature (either as an MLflow Plugin or an enhancement to the MLflow code base)?
- [ ] Yes. I can contribute this feature independently.
- [X] Yes. I would be willing to contribute this feature with guidance from the MLflow community.
- [ ] No. I cannot contribute this feature at this time.
Proposal Summary
Add Pandas category dtype, to mlflow.types.schema().
I currently have a ML model based on LightGBM running on production environment that contains categorical variables. Instead of label-encoding or OHE the categorical variables, I'm using Pandas dataframe with categorical columns set to be of the categorical dtype.
df['some_categorical_column'] = df['some_categorical_column'].astype('category') .
I'm trying to deploy this model on Databricks Serve Model that uses MLFlow as backend, but currently this category dtype is not supported for model signature according to the available dtypes on https://www.mlflow.org/docs/latest/_modules/mlflow/types/schema.html, so I'm unable to correctly cast the prediction Dataframe accordingly to the training phase.
It makes me think that deploying models that were trained with LightGBM using Pandas category dtypes, cannot be correctly deployed since this dtype is not availabe, so it would occur some casting error during the prediction leading to incorrect scores/predictions.
I could of course OHE the data and outline the problem, but I'm trying to deploy the current model into MLFlow. I'm not sure if this is a FR or something else.
What component(s), interfaces, languages, and integrations does this feature affect?
Components
- [ ]
area/artifacts
: Artifact stores and artifact logging - [ ]
area/build
: Build and test infrastructure for MLflow - [ ]
area/docs
: MLflow documentation pages - [ ]
area/examples
: Example code - [X]
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry - [X]
area/models
: MLmodel format, model serialization/deserialization, flavors - [ ]
area/projects
: MLproject format, project running backends - [ ]
area/scoring
: Local serving, model deployment tools, spark UDFs - [ ]
area/server-infra
: MLflow server, JavaScript dev server - [ ]
area/tracking
: Tracking Service, tracking client APIs, autologging
@tomasatdatabricks @dbczumar is this something that should be supported?
Possibly yes, but it needs a bit of work.
The tricky part about categoricals is that the dictionary can be different between training and inference data. So you need to store the mapping seen during training and make sure you map the data the same way at inference time. You also need to handle categories unseen during training. We would also need to figure out where to store these additional dictionaries - probably as additional artifacts. It can be done, but it is not clear how often would people use this instead of using something like scikit pipeline.
Any advances on this? I think it would be useful as there are some algorithms which support null values directly (e.g. XGBoost, which also supports categorical variables in a straightforward way from now on).
I would also like to have this feature to be able to deploy LGBM models easily that use features declared as categorical to use optimal bin sizes https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html#categorical-feature-support ).
I know this is an old issue, but would love to see support for this as well. auto-sklearn
makes use of the category dtype as well to add the list of libraries that directly leverage this functionality.
As an intermediate solution, it could be possible to accept the category type by casting to a generic object
dtype and ignoring the dictionaries of category values. This obviously isn't ideal, but this is the local workaround that I am currently doing to successfully pass my inputs to infer_signature
.
@tomasatdatabricks What are your latest thoughts here?
@dbczumar seems reasonable to support this based on the feedback.
Any updates for this request?
Any updates on this feature?
Also interested in an update on this :)
Would love to have an update on this as well, spent quite a bit of time trying to debug an issue with serving LightGBM model with categoricals via MLFlow and finding out that this apparently is not possible.
Would this be related to an issue that I have with running evaluate on an XGBoost model? Getting the error:
File "C:\Python310\lib\site-packages\mlflow\models\utils.py", line 577, in _enforce_named_col_schema new_pf_input[x] = _enforce_mlflow_datatype(x, pf_input[x], input_types[x]) File "C:\Python310\lib\site-packages\mlflow\models\utils.py", line 554, in _enforce_mlflow_datatype raise MlflowException( mlflow.exceptions.MlflowException: Incompatible input types for column Feature_RecruiterState. Can not safely convert category to <U0.
As an update to this, using "category" type works with MLFlow.
Would this be related to an issue that I have with running evaluate on an XGBoost model? Getting the error:
File "C:\Python310\lib\site-packages\mlflow\models\utils.py", line 577, in _enforce_named_col_schema new_pf_input[x] = _enforce_mlflow_datatype(x, pf_input[x], input_types[x]) File "C:\Python310\lib\site-packages\mlflow\models\utils.py", line 554, in _enforce_mlflow_datatype raise MlflowException( mlflow.exceptions.MlflowException: Incompatible input types for column Feature_RecruiterState. Can not safely convert category to <U0.
I had the same error with LightGBM. In the end I managed to fix this specifying categorical features explicitly, using "categorical_feature" argument.