ray icon indicating copy to clipboard operation
ray copied to clipboard

[AIR] XGBoost/LightGBM predictors lose dtype information

Open Yard1 opened this issue 3 years ago • 0 comments

What happened + What you expected to happen

XGBoost and LightGBM have special support for categorical features. If during training a pandas dataframe with Categorical columns is passed, then during prediction, a dataframe with those exact column dtypes will be expected.

Inside XGBoost/LightGBM predictors, we always convert to numpy first and then to pandas. This is undesirable, as it causes dtype information to be lost. We attempt to infer the dtypes again later, but categorical columns composed of integers get classified as int, which then leads to an exception as the condition from above is not fulfilled. Instead, we should only convert to numpy if necessary.

We are also not really testing either predictor with categorical features, including this case, where the categories may be ints.

Quick patch - https://github.com/anyscale/hackathon-2022-automl/blob/main/ray_automl/components.py#L239

Versions / Dependencies

master

Reproduction script

import lightgbm as lgbm
import numpy as np
import pandas as pd
from ray.data.preprocessor import Preprocessor
from ray.train.lightgbm import LightGBMPredictor

dummy_data = np.array([[1, 2], [3, 4], [5, 6]])
dummy_target = np.array([0, 1, 0])

class DummyPreprocessor(Preprocessor):
    def transform_batch(self, df):
        self._batch_transformed = True
        return df

pandas_data = pd.DataFrame(dummy_data, columns=["A", "B"])
pandas_data["A"] = pandas_data["A"].astype("category")
pandas_target = pd.Series(dummy_target)
pandas_model = (
    lgbm.LGBMClassifier(n_estimators=10).fit(pandas_data, pandas_target).booster_
)
preprocessor = DummyPreprocessor()
predictor = LightGBMPredictor(model=pandas_model, preprocessor=preprocessor)
data_batch = pd.DataFrame(
    np.array([[1, 2, 2], [3, 4, 8], [5, 6, 9]]), columns=["A", "B", "C"]
)
data_batch["A"] = data_batch["A"].astype("category")
predictions = predictor.predict(data_batch, feature_columns=["A", "B"])

assert len(predictions) == 3
assert hasattr(predictor.get_preprocessor(), "_batch_transformed")

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Yard1 avatar Sep 19 '22 20:09 Yard1