SynapseML
SynapseML copied to clipboard
mml spark Incorrect lightgbm predictions
Pretext
I trained the model using the dataset created from a pandas df viz
train_dataset = lgb.Dataset( df, label=df["label"], weight=df["weight"].values, free_raw_data=True, )
and then training
model = lgb.train( config["hyperparams"], train_dataset, verbose_eval=0, valid_sets=[test_dataset], )
I write the model string to a file on hdfs, named model.lgb
Objective To use this file on hdfs to load with mmlSpark and make predictions.
Expectation
The predictions on the pandas df should match with the predictions on spark df in mmlSpark.
What's actually happening is predictions don't match?
The code that I am running on pyspark is:
from mmlspark.lightgbm import LightGBMClassifier, LightGBMClassificationModel
from pyspark.ml.feature import StringIndexer, VectorAssembler
model_path = "hdfs://nameservice1/user/admin/model.lgb"
model = LightGBMClassificationModel.loadNativeModelFromFile(model_path)
df = spark.sql("select * from rpm_misc.mml_test")
ft ==> list of features
features=[]
for f in ft:
string_indexer = StringIndexer(inputCol=f, outputCol=f + "_index")
model_si = string_indexer.fit(data)
data = model_si.transform(data)
features.append(f + "_index")
vector_assembler = VectorAssembler(inputCols=eatures, outputCol="features")
data = vector_assembler.transform(data)
preds = model.transform(data)
pred_probs = preds.select("probability").rdd.flatMap(lambda x: x).collect()
print(pred_probs)
What am I doing wrong?
--repositories https://mmlspark.azureedge.net/maven --packages com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc3
@alzio2607 could you please send a notebook that reproduces the issue?
"The predictions on the pandas df should match with the predictions on spark df in mmlSpark." I agree that these should match.
I see spark code above in your example following the line "The code that I am running on pyspark is:", but I'm not sure how you are calling the trained lightgbm spark model on a pandas dataframe. Could you send a full end-to-end repro on a dummy dataset for us to debug? Thank you!
@imatiach-msft
Here you go. This has the dataset and the notebook to load and predict both ways.
https://github.com/alzio2607/mmlspark-dummy
for some reason I can't add the model file to the repo right now.
Here is the model file: https://drive.google.com/file/d/1S784kSoYe6lDXm1ZgvRQlCq94w6HOvDV/view?usp=sharing
Hope this is enough to reproduce