yocto-gl icon indicating copy to clipboard operation
yocto-gl copied to clipboard

[BUG] infer_signature fails when categorical Pandas series contains a null

Open le1tz3y opened this issue 3 years ago • 2 comments

Thank you for submitting an issue. Please refer to our issue policy for additional information about bug reports. For help with debugging your code, please refer to Stack Overflow.

Please fill in this bug report template to ensure a timely and thorough response.

Willingness to contribute

The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?

  • [ ] Yes. I can contribute a fix for this bug independently.
  • [ ] Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
  • [X] No. I cannot contribute a bug fix at this time.

System information

  • Have I written custom code (as opposed to using a stock example script provided in MLflow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • MLflow installed from (source or binary):
  • MLflow version (run mlflow --version):
  • Python version:
  • npm version, if running the dev UI:
  • Exact command to reproduce:

Describe the problem

Describe the problem clearly here. Include descriptions of the expected behavior and the actual behavior. infer_signature fails when a pandas series is of dtype 'category' and a null value is present. Lightgbm and some other packages require categorical columns to be of this dtype.

Code to reproduce issue

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Code Ex 1:

import numpy as np
import pandas as pd
from mlflow.models.signature import infer_signature

data = pd.DataFrame({"a": ["foo", np.nan]})
data["a"] = data["a"].fillna('').astype('category')
infer_signature(data)

Code Ex 2:

import numpy as np
import pandas as pd
from mlflow.models.signature import infer_signature

data = pd.DataFrame({"a": ["foo", ""]})
data["a"] = data["a"].astype('category')
infer_signature(data)

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

---------------------------------------------------------------------------
MlflowException                           Traceback (most recent call last)
<command-905257552727287> in <module>
      7 data["a"] = data["a"].fillna('').astype('category')
      8 print(data["a"].dtype)
----> 9 infer_signature(data)

/databricks/python/lib/python3.8/site-packages/mlflow/models/signature.py in infer_signature(model_input, model_output)
    127     :return: ModelSignature
    128     """
--> 129     inputs = _infer_schema(model_input)
    130     outputs = _infer_schema(model_output) if model_output is not None else None
    131     return ModelSignature(inputs, outputs)

/databricks/python/lib/python3.8/site-packages/mlflow/types/utils.py in _infer_schema(data)
    117     elif isinstance(data, pd.DataFrame):
    118         schema = Schema(
--> 119             [ColSpec(type=_infer_pandas_column(data[col]), name=col) for col in data.columns]
    120         )
    121     elif isinstance(data, np.ndarray):

/databricks/python/lib/python3.8/site-packages/mlflow/types/utils.py in <listcomp>(.0)
    117     elif isinstance(data, pd.DataFrame):
    118         schema = Schema(
--> 119             [ColSpec(type=_infer_pandas_column(data[col]), name=col) for col in data.columns]
    120         )
    121     elif isinstance(data, np.ndarray):

/databricks/python/lib/python3.8/site-packages/mlflow/types/utils.py in _infer_pandas_column(col)
    229             return DataType.string
    230         else:
--> 231             raise MlflowException(
    232                 "Unable to map 'np.object' type to MLflow DataType. np.object can"
    233                 "be mapped iff all values have identical data type which is one "

MlflowException: Unable to map 'np.object' type to MLflow DataType. np.object canbe mapped iff all values have identical data type which is one of (string, (bytes or byterray),  int, float).

What component(s), interfaces, languages, and integrations does this bug affect?

Components

  • [ ] area/artifacts: Artifact stores and artifact logging
  • [ ] area/build: Build and test infrastructure for MLflow
  • [ ] area/docs: MLflow documentation pages
  • [ ] area/examples: Example code
  • [ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • [ ] area/models: MLmodel format, model serialization/deserialization, flavors
  • [ ] area/projects: MLproject format, project running backends
  • [ ] area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • [ ] area/server-infra: MLflow Tracking server backend
  • [X] area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • [ ] area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • [ ] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • [ ] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • [ ] area/windows: Windows support

Language

  • [ ] language/r: R APIs and clients
  • [ ] language/java: Java APIs and clients
  • [ ] language/new: Proposals for new client languages

Integrations

  • [ ] integrations/azure: Azure and Azure ML integrations
  • [ ] integrations/sagemaker: SageMaker integrations
  • [ ] integrations/databricks: Databricks integrations

le1tz3y avatar Dec 08 '21 02:12 le1tz3y

I would say it is not caused by the null value but by having a categorical column, no matter if it has nulls or not. Pandas category columns are not supported by infer_signature yet. Try this and you will get the same exception:

import numpy as np
import pandas as pd
from mlflow.models.signature import infer_signature

data = pd.DataFrame({"a": ["foo", "bar"]})
data["a"] = data["a"].astype('category')
infer_signature(data)

olbapjose avatar Sep 07 '22 10:09 olbapjose

Is it resolved in newer versions?

HaveFunWithCode avatar Apr 21 '24 11:04 HaveFunWithCode