SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

mml spark Incorrect lightgbm predictions

Open alzio2607 opened this issue 3 years ago • 2 comments

Pretext I trained the model using the dataset created from a pandas df viz train_dataset = lgb.Dataset( df, label=df["label"], weight=df["weight"].values, free_raw_data=True, )

and then training model = lgb.train( config["hyperparams"], train_dataset, verbose_eval=0, valid_sets=[test_dataset], )

I write the model string to a file on hdfs, named model.lgb

Objective To use this file on hdfs to load with mmlSpark and make predictions.

Expectation

The predictions on the pandas df should match with the predictions on spark df in mmlSpark.

What's actually happening is predictions don't match?

The code that I am running on pyspark is:

from mmlspark.lightgbm import LightGBMClassifier, LightGBMClassificationModel
from pyspark.ml.feature import StringIndexer, VectorAssembler
model_path = "hdfs://nameservice1/user/admin/model.lgb"
model = LightGBMClassificationModel.loadNativeModelFromFile(model_path)

df = spark.sql("select * from rpm_misc.mml_test")


ft ==> list of features
features=[]

for f in ft:
    string_indexer = StringIndexer(inputCol=f, outputCol=f + "_index")
    model_si = string_indexer.fit(data)
    data = model_si.transform(data)
    features.append(f + "_index")


vector_assembler = VectorAssembler(inputCols=eatures, outputCol="features")
data = vector_assembler.transform(data)
preds = model.transform(data)

pred_probs = preds.select("probability").rdd.flatMap(lambda x: x).collect()
print(pred_probs)

What am I doing wrong?

--repositories https://mmlspark.azureedge.net/maven --packages com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc3

alzio2607 avatar May 18 '22 12:05 alzio2607

@alzio2607 could you please send a notebook that reproduces the issue?

"The predictions on the pandas df should match with the predictions on spark df in mmlSpark." I agree that these should match.

I see spark code above in your example following the line "The code that I am running on pyspark is:", but I'm not sure how you are calling the trained lightgbm spark model on a pandas dataframe. Could you send a full end-to-end repro on a dummy dataset for us to debug? Thank you!

imatiach-msft avatar May 18 '22 14:05 imatiach-msft

@imatiach-msft
Here you go. This has the dataset and the notebook to load and predict both ways. https://github.com/alzio2607/mmlspark-dummy

for some reason I can't add the model file to the repo right now.

Here is the model file: https://drive.google.com/file/d/1S784kSoYe6lDXm1ZgvRQlCq94w6HOvDV/view?usp=sharing

Hope this is enough to reproduce

alzio2607 avatar May 18 '22 16:05 alzio2607