mleap
mleap copied to clipboard
Serialized then deserialized model does not have the same results (all zero's)
It's not clear to me why this is happening. I was first using spark 2.4.3 but then changed back to 2.3.3 (as I believe support for 2.4 was being considered and may have just been added). I have attached the data file (public domain). I also checked that the GBTClassifier is supported from Spark ML.
Python version 3.7.3 mleap version 0.13.0
import pandas as pd
import pyspark
from pyspark.ml import Pipeline
from pyspark.ml import PipelineModel
from pyspark.ml.recommendation import ALS
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
from pyspark.ml.classification import GBTClassifier
import mleap
import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer # this causes the dynamic binding to model
dataset_location = "<FIXME>/income.data.4k.txt"
models_folder = "<FIXME>"
df = pd.read_csv(dataset_location)
df.columns = [col.strip() for col in df.columns]
for col in df.columns:
if str(df.dtypes[col]) == "object":
df[col] = df[col].apply(lambda x: x.strip() if type(x) is str else x)
df = pd.get_dummies(df)
df = df.drop("yearly-income_<=50K", axis = 1)
target = "yearly-income"
columns = list(df.columns)
columns[columns.index("yearly-income_>50K")] = target
df.columns = columns
sqlCtx = SQLContext(SparkContext.getOrCreate())
sdf = sqlCtx.createDataFrame(df)
assembler = VectorAssembler(inputCols=[col for col in columns if col != target], outputCol="features")
gbt = GBTClassifier(maxIter=10, maxDepth=4, labelCol = target, featuresCol = "features")
pipeline = Pipeline(stages=[assembler, gbt])
model = pipeline.fit(sdf)
results = model.transform(sdf)
# FIXME: Not sure why you need the df after. It fails if I dont supply it.
# path must be absolute
r = results.toPandas()
r.describe()
r.head()
model.serializeToBundle("jar:file:{}/pyspark.gbt_pipeline.zip".format(models_folder), results)
model = PipelineModel.deserializeFromBundle("jar:file:{}/pyspark.gbt_pipeline.zip".format(models_folder))
results2 = model.transform(sdf)
r2 = results2.toPandas()
r2.describe()
r2.head()
The output from the first time:
>>> r.describe()
age fnlwgt education-num capital-gain ... native-country_Vietnam native-country_Yugoslavia yearly-income prediction
count 4000.000000 4.000000e+03 4000.00000 4000.000000 ... 4000.000000 4000.000000 4000.000000 4000.000000
mean 38.213750 1.889953e+05 10.07600 1071.823500 ... 0.002500 0.000500 0.242750 0.202000
std 13.764971 1.070024e+05 2.58629 7294.042604 ... 0.049944 0.022358 0.428799 0.401542
min 17.000000 1.228500e+04 1.00000 0.000000 ... 0.000000 0.000000 0.000000 0.000000
25% 27.000000 1.169508e+05 9.00000 0.000000 ... 0.000000 0.000000 0.000000 0.000000
50% 36.000000 1.771715e+05 10.00000 0.000000 ... 0.000000 0.000000 0.000000 0.000000
75% 47.000000 2.366330e+05 13.00000 0.000000 ... 0.000000 0.000000 0.000000 0.000000
max 90.000000 1.226583e+06 16.00000 99999.000000 ... 1.000000 1.000000 1.000000 1.000000
[8 rows x 108 columns]
>>> r.head()
age fnlwgt education-num ... rawPrediction probability prediction
0 53 20438 10 ... [1.1557390749607264, -1.1557390749607264] [0.9098232035096955, 0.09017679649030452] 0.0
1 39 216552 14 ... [-0.34625038604211694, 0.34625038604211694] [0.3334769951336344, 0.6665230048663656] 1.0
2 54 143865 9 ... [1.1837202808156024, -1.1837202808156024] [0.9143105493591535, 0.08568945064084654] 0.0
3 38 172855 13 ... [-1.4172112252841806, 1.4172112252841806] [0.055492148863025934, 0.9445078511369741] 1.0
4 17 198146 7 ... [1.3097197675583097, -1.3097197675583097] [0.932102244506494, 0.06789775549350596] 0.0
The second time (scroll to the right to see that all the predictions are 0):
>>> r2.describe()
age fnlwgt education-num capital-gain ... native-country_Vietnam native-country_Yugoslavia yearly-income prediction
count 4000.000000 4.000000e+03 4000.00000 4000.000000 ... 4000.000000 4000.000000 4000.000000 4000.0
mean 38.213750 1.889953e+05 10.07600 1071.823500 ... 0.002500 0.000500 0.242750 0.0
std 13.764971 1.070024e+05 2.58629 7294.042604 ... 0.049944 0.022358 0.428799 0.0
min 17.000000 1.228500e+04 1.00000 0.000000 ... 0.000000 0.000000 0.000000 0.0
25% 27.000000 1.169508e+05 9.00000 0.000000 ... 0.000000 0.000000 0.000000 0.0
50% 36.000000 1.771715e+05 10.00000 0.000000 ... 0.000000 0.000000 0.000000 0.0
75% 47.000000 2.366330e+05 13.00000 0.000000 ... 0.000000 0.000000 0.000000 0.0
max 90.000000 1.226583e+06 16.00000 99999.000000 ... 1.000000 1.000000 1.000000 0.0
[8 rows x 108 columns]
>>> r2.head()
age fnlwgt education-num capital-gain capital-loss ... yearly-income features rawPrediction probability prediction
0 53 20438 10 0 0 ... 0 (53.0, 20438.0, 10.0, 0.0, 0.0, 15.0, 0.0, 0.0... [-0.0, 0.0] [0.5, 0.5] 0.0
1 39 216552 14 0 0 ... 0 (39.0, 216552.0, 14.0, 0.0, 0.0, 40.0, 0.0, 0.... [-0.0, 0.0] [0.5, 0.5] 0.0
2 54 143865 9 0 0 ... 0 (54.0, 143865.0, 9.0, 0.0, 0.0, 35.0, 0.0, 0.0... [-0.0, 0.0] [0.5, 0.5] 0.0
3 38 172855 13 0 1887 ... 1 (38.0, 172855.0, 13.0, 0.0, 1887.0, 40.0, 0.0,... [-0.0, 0.0] [0.5, 0.5] 0.0
4 17 198146 7 0 0 ... 0 (17.0, 198146.0, 7.0, 0.0, 0.0, 16.0, 0.0, 0.0... [-0.0, 0.0] [0.5, 0.5] 0.0
Anyone can answer this question?
@aaditya3 @psc0606 we've done some fixes for deserializing models back into Spark, we're doing the release for mleap 0.14.0 the next few days, which I believe would solve this problem. I will keep you posted once the new version is available!
@aaditya3 @psc0606 we've released 0.14.0 recently, could you please check if it's still an issue?
I would guess this relates to my issue here: https://github.com/combust/mleap/issues/618