mleap icon indicating copy to clipboard operation
mleap copied to clipboard

Impossible to deserialize a bundle written with Scikit-Learn with Pyspark: no "bundle.json found"

Open NastasiaSaby opened this issue 5 years ago • 2 comments

I serialize a model with Scikit-Learn:

#Generate data
import pandas as pd 
import numpy as np

df = pd.DataFrame(np.random.randn(100, 5), columns=['a', 'b', 'c', 'd', 'e'])
df["y"] = (df['a'] > 0.5).astype(int)
df.head()

from mleap.sklearn.ensemble.forest import RandomForestClassifier

forestModel = RandomForestClassifier()
forestModel.mlinit(input_features='a',
                   feature_names='a',
                           prediction_column='e_binary')


forestModel.fit(df[['a']], df[['y']])

forestModel.serialize_to_bundle("/dbfs/FileStore/tables/mleaptestmodelforest", "model.json")

When I try to read it with Pyspark:

from pyspark.ml.classification import RandomForestClassificationModel

model = RandomForestClassificationModel.deserializeFromBundle("file:/dbfs/FileStore/tables/mleaptestmodelforest")

I have this error: java.nio.file.NoSuchFileException: /dbfs/FileStore/tables/mleaptestmodelforest/bundle.json

I have no "bundle.json".

Could you help me please? Is it really possible to seralize a model with Scikit-Learn and deserialize it with Pyspark?

NastasiaSaby avatar Jun 03 '20 12:06 NastasiaSaby

Could you try with string indexer before sending your inputs to RF model.

preet3loq avatar Jul 04 '20 16:07 preet3loq

Thank for your answer. But I didn't understand. I have no string in my features or in my label.

a b c d e y
-0.834754 0.022853 -0.409484 -0.234555 1.459009 0
0.701790 -1.227054 0.318048 -0.427834 0.181128 1
2.477625 -0.337587 0.509159 1.733497 -1.133314 1
-1.192845 -0.314039 0.857098 -0.772631 0.999143 0
-1.163715 0.640511 0.546631 1.823843 -0.176281 0
df.dtypes
Out[3]: a    float64
b    float64
c    float64
d    float64
e    float64
y      int64
dtype: object

NastasiaSaby avatar Jul 06 '20 06:07 NastasiaSaby