mleap icon indicating copy to clipboard operation
mleap copied to clipboard

Serialized then deserialized model does not have the same results (all zero's)

Open aaditya3 opened this issue 5 years ago • 4 comments

It's not clear to me why this is happening. I was first using spark 2.4.3 but then changed back to 2.3.3 (as I believe support for 2.4 was being considered and may have just been added). I have attached the data file (public domain). I also checked that the GBTClassifier is supported from Spark ML.

Python version 3.7.3 mleap version 0.13.0

import pandas as pd
import pyspark
from pyspark.ml import Pipeline
from pyspark.ml import PipelineModel
from pyspark.ml.recommendation import ALS
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
from pyspark.ml.classification import GBTClassifier
import mleap
import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer # this causes the dynamic binding to model

dataset_location = "<FIXME>/income.data.4k.txt"
models_folder = "<FIXME>"

df = pd.read_csv(dataset_location)
df.columns = [col.strip() for col in df.columns]
for col in df.columns:
	if str(df.dtypes[col]) == "object":
		df[col] = df[col].apply(lambda x: x.strip() if type(x) is str else x)

df = pd.get_dummies(df)

df = df.drop("yearly-income_<=50K", axis = 1)
target = "yearly-income"
columns = list(df.columns)
columns[columns.index("yearly-income_>50K")] = target
df.columns = columns
sqlCtx = SQLContext(SparkContext.getOrCreate())
sdf = sqlCtx.createDataFrame(df)

assembler = VectorAssembler(inputCols=[col for col in columns if col != target], outputCol="features")
gbt = GBTClassifier(maxIter=10, maxDepth=4, labelCol = target, featuresCol = "features")
pipeline = Pipeline(stages=[assembler, gbt])
model = pipeline.fit(sdf)
results = model.transform(sdf)
# FIXME: Not sure why you need the df after. It fails if I dont supply it.
# path must be absolute
r = results.toPandas()
r.describe()
r.head()
model.serializeToBundle("jar:file:{}/pyspark.gbt_pipeline.zip".format(models_folder), results)
model = PipelineModel.deserializeFromBundle("jar:file:{}/pyspark.gbt_pipeline.zip".format(models_folder))

results2 = model.transform(sdf)
r2 = results2.toPandas()
r2.describe()
r2.head()

The output from the first time:

>>> r.describe()
               age        fnlwgt  education-num  capital-gain  ...  native-country_Vietnam  native-country_Yugoslavia  yearly-income   prediction
count  4000.000000  4.000000e+03     4000.00000   4000.000000  ...             4000.000000                4000.000000    4000.000000  4000.000000
mean     38.213750  1.889953e+05       10.07600   1071.823500  ...                0.002500                   0.000500       0.242750     0.202000
std      13.764971  1.070024e+05        2.58629   7294.042604  ...                0.049944                   0.022358       0.428799     0.401542
min      17.000000  1.228500e+04        1.00000      0.000000  ...                0.000000                   0.000000       0.000000     0.000000
25%      27.000000  1.169508e+05        9.00000      0.000000  ...                0.000000                   0.000000       0.000000     0.000000
50%      36.000000  1.771715e+05       10.00000      0.000000  ...                0.000000                   0.000000       0.000000     0.000000
75%      47.000000  2.366330e+05       13.00000      0.000000  ...                0.000000                   0.000000       0.000000     0.000000
max      90.000000  1.226583e+06       16.00000  99999.000000  ...                1.000000                   1.000000       1.000000     1.000000

[8 rows x 108 columns]
>>> r.head()
   age  fnlwgt  education-num  ...                                rawPrediction                                 probability  prediction
0   53   20438             10  ...    [1.1557390749607264, -1.1557390749607264]   [0.9098232035096955, 0.09017679649030452]         0.0
1   39  216552             14  ...  [-0.34625038604211694, 0.34625038604211694]    [0.3334769951336344, 0.6665230048663656]         1.0
2   54  143865              9  ...    [1.1837202808156024, -1.1837202808156024]   [0.9143105493591535, 0.08568945064084654]         0.0
3   38  172855             13  ...    [-1.4172112252841806, 1.4172112252841806]  [0.055492148863025934, 0.9445078511369741]         1.0
4   17  198146              7  ...    [1.3097197675583097, -1.3097197675583097]    [0.932102244506494, 0.06789775549350596]         0.0

The second time (scroll to the right to see that all the predictions are 0):

>>> r2.describe()
               age        fnlwgt  education-num  capital-gain  ...  native-country_Vietnam  native-country_Yugoslavia  yearly-income  prediction
count  4000.000000  4.000000e+03     4000.00000   4000.000000  ...             4000.000000                4000.000000    4000.000000      4000.0
mean     38.213750  1.889953e+05       10.07600   1071.823500  ...                0.002500                   0.000500       0.242750         0.0
std      13.764971  1.070024e+05        2.58629   7294.042604  ...                0.049944                   0.022358       0.428799         0.0
min      17.000000  1.228500e+04        1.00000      0.000000  ...                0.000000                   0.000000       0.000000         0.0
25%      27.000000  1.169508e+05        9.00000      0.000000  ...                0.000000                   0.000000       0.000000         0.0
50%      36.000000  1.771715e+05       10.00000      0.000000  ...                0.000000                   0.000000       0.000000         0.0
75%      47.000000  2.366330e+05       13.00000      0.000000  ...                0.000000                   0.000000       0.000000         0.0
max      90.000000  1.226583e+06       16.00000  99999.000000  ...                1.000000                   1.000000       1.000000         0.0

[8 rows x 108 columns]
>>> r2.head()
   age  fnlwgt  education-num  capital-gain  capital-loss  ...  yearly-income                                           features  rawPrediction  probability  prediction
0   53   20438             10             0             0  ...              0  (53.0, 20438.0, 10.0, 0.0, 0.0, 15.0, 0.0, 0.0...    [-0.0, 0.0]   [0.5, 0.5]         0.0
1   39  216552             14             0             0  ...              0  (39.0, 216552.0, 14.0, 0.0, 0.0, 40.0, 0.0, 0....    [-0.0, 0.0]   [0.5, 0.5]         0.0
2   54  143865              9             0             0  ...              0  (54.0, 143865.0, 9.0, 0.0, 0.0, 35.0, 0.0, 0.0...    [-0.0, 0.0]   [0.5, 0.5]         0.0
3   38  172855             13             0          1887  ...              1  (38.0, 172855.0, 13.0, 0.0, 1887.0, 40.0, 0.0,...    [-0.0, 0.0]   [0.5, 0.5]         0.0
4   17  198146              7             0             0  ...              0  (17.0, 198146.0, 7.0, 0.0, 0.0, 16.0, 0.0, 0.0...    [-0.0, 0.0]   [0.5, 0.5]         0.0

income.data.4k.txt

aaditya3 avatar Jun 01 '19 23:06 aaditya3

Anyone can answer this question?

psc0606 avatar Jun 19 '19 07:06 psc0606

@aaditya3 @psc0606 we've done some fixes for deserializing models back into Spark, we're doing the release for mleap 0.14.0 the next few days, which I believe would solve this problem. I will keep you posted once the new version is available!

ancasarb avatar Jun 21 '19 20:06 ancasarb

@aaditya3 @psc0606 we've released 0.14.0 recently, could you please check if it's still an issue?

ancasarb avatar Jul 11 '19 09:07 ancasarb

I would guess this relates to my issue here: https://github.com/combust/mleap/issues/618

Ben-Epstein avatar Mar 16 '20 19:03 Ben-Epstein