mleap
mleap copied to clipboard
Impossible to deserialize a bundle written with Scikit-Learn with Pyspark: no "bundle.json found"
I serialize a model with Scikit-Learn:
#Generate data
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100, 5), columns=['a', 'b', 'c', 'd', 'e'])
df["y"] = (df['a'] > 0.5).astype(int)
df.head()
from mleap.sklearn.ensemble.forest import RandomForestClassifier
forestModel = RandomForestClassifier()
forestModel.mlinit(input_features='a',
feature_names='a',
prediction_column='e_binary')
forestModel.fit(df[['a']], df[['y']])
forestModel.serialize_to_bundle("/dbfs/FileStore/tables/mleaptestmodelforest", "model.json")
When I try to read it with Pyspark:
from pyspark.ml.classification import RandomForestClassificationModel
model = RandomForestClassificationModel.deserializeFromBundle("file:/dbfs/FileStore/tables/mleaptestmodelforest")
I have this error:
java.nio.file.NoSuchFileException: /dbfs/FileStore/tables/mleaptestmodelforest/bundle.json
I have no "bundle.json".
Could you help me please? Is it really possible to seralize a model with Scikit-Learn and deserialize it with Pyspark?
Could you try with string indexer before sending your inputs to RF model.
Thank for your answer. But I didn't understand. I have no string in my features or in my label.
| a | b | c | d | e | y |
|---|---|---|---|---|---|
| -0.834754 | 0.022853 | -0.409484 | -0.234555 | 1.459009 | 0 |
| 0.701790 | -1.227054 | 0.318048 | -0.427834 | 0.181128 | 1 |
| 2.477625 | -0.337587 | 0.509159 | 1.733497 | -1.133314 | 1 |
| -1.192845 | -0.314039 | 0.857098 | -0.772631 | 0.999143 | 0 |
| -1.163715 | 0.640511 | 0.546631 | 1.823843 | -0.176281 | 0 |
df.dtypes
Out[3]: a float64
b float64
c float64
d float64
e float64
y int64
dtype: object