yocto-gl icon indicating copy to clipboard operation
yocto-gl copied to clipboard

[FR] Add argument (e.g. `save_format`) to specify model save format to `mlflow.xgboost.log_model`

Open harupy opened this issue 2 years ago • 7 comments

Willingness to contribute

No. I cannot contribute this feature at this time.

Proposal Summary

The XGBoost model file name is currently hardcoded as model.xgb:

https://github.com/mlflow/mlflow/blob/cf96d39d81932423d70d267f8a4ab6640e14ec7e/mlflow/xgboost/init.py#L154

This makes it impossible to save an XGBoost model in JSON format. A new argument to specify saving format should be added and the line above should be fixed as follows:

- model_data_subpath = "model.xgb" 
+ model_data_subpath = f"model.{save_format}" 

References:

  • https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html
  • https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier.save_model

Motivation

What is the use case for this feature?

To provide a way to save an XGBoost model in JSON format.

Why is this use case valuable to support for MLflow users in general?

Why is this use case valuable to support for your project(s) or organization?

Why is it currently difficult to achieve this use case?

Details

No response

What component(s) does this bug affect?

  • [ ] area/artifacts: Artifact stores and artifact logging
  • [ ] area/build: Build and test infrastructure for MLflow
  • [ ] area/docs: MLflow documentation pages
  • [ ] area/examples: Example code
  • [ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • [X] area/models: MLmodel format, model serialization/deserialization, flavors
  • [ ] area/pipelines: Pipelines, Pipeline APIs, Pipeline configs, Pipeline Templates
  • [ ] area/projects: MLproject format, project running backends
  • [ ] area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • [ ] area/server-infra: MLflow Tracking server backend
  • [ ] area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • [ ] area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • [ ] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • [ ] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • [ ] area/windows: Windows support

What language(s) does this bug affect?

  • [ ] language/r: R APIs and clients
  • [ ] language/java: Java APIs and clients
  • [ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • [ ] integrations/azure: Azure and Azure ML integrations
  • [ ] integrations/sagemaker: SageMaker integrations
  • [ ] integrations/databricks: Databricks integrations

harupy avatar Aug 29 '22 06:08 harupy

What about add a param to log_model like "model_format" ? Available values can be "json" or "binary". And we can consider add this param to autolog.

WeichenXu123 avatar Aug 29 '22 09:08 WeichenXu123

@WeichenXu123 That sounds good. I'll fix the PR title and description.

harupy avatar Aug 29 '22 09:08 harupy

@BenWilson2 @dbczumar @harupy @WeichenXu123 Please assign a maintainer and start triaging this issue.

mlflow-automation avatar Sep 06 '22 00:09 mlflow-automation

Just a comment: maybe this would also fix the error thrown when trying to xgboost.log_model with an XGBoost model with categorical variables using the (experimental) support for DataFrame category data type. The error says:

xgboost.core.XGBoostError: [16:56:47] ../src/tree/tree_model.cc:871: Check failed: !HasCategoricalSplit(): 
Please use JSON/UBJSON for saving models with categorical splits.

olbapjose avatar Sep 08 '22 17:09 olbapjose

@olbapjose Awesome! Would you be interested in contributing support for this feature?

dbczumar avatar Sep 08 '22 17:09 dbczumar

@dbczumar Thanks a lot but unfortunately I do not have time these weeks... despite I have been struggling very hard to log an XGBoost model and then retrieve it. No way with Spark XGBoost (it was possible to log it but not to retrieve it when it is a multi-class model), and no way with python xgboost as it contains category columns...

olbapjose avatar Sep 08 '22 17:09 olbapjose

@olbapjose Got it! Understood! Thank you for your feedback; we're excited about getting this issue resolved.

dbczumar avatar Sep 08 '22 17:09 dbczumar

Hey @dbczumar @harupy can I work on this issue?

AvikantSrivastava avatar Oct 13 '22 14:10 AvikantSrivastava

@AvikantSrivastava Yes, go ahead!

harupy avatar Oct 13 '22 14:10 harupy

@harupy @dbczuma can you take a look at the PR https://github.com/mlflow/mlflow/pull/7068 ?

couple of quick questions

  • Do we have any example to test out xgboost model? I want to do end-to-end testing
  • Looks like this might require docs update, where to update those?
  • I have used a different name for param model_format instead of save_format is that okay?

AvikantSrivastava avatar Oct 16 '22 22:10 AvikantSrivastava

Do we have any example to test out xgboost model? I want to do end-to-end testing

https://github.com/mlflow/mlflow/blob/master/examples/xgboost/xgboost_native/train.py

Looks like this might require docs update, where to update those?

We can add a parameter docstring.

def f(...):
    """
    ...
    :param model_format: ...
    """

I have used a different name for param model_format instead of save_format is that okay?

Sounds good :)

harupy avatar Oct 17 '22 01:10 harupy