SynapseML
SynapseML copied to clipboard
How to install to EMR from maven repository to /usr/lib/spark/jars
SynapseML version
1.0.10
System information
- Language version (e.g. python 3.8, scala 2.12): python 3.9
- Spark Version (e.g. 3.2.3): 3.5.1
- Spark Platform (e.g. Synapse, Databricks): AWS EMR Release 7.3.1
Describe the problem
Now I would like to try to install SynapseML to EMR for pyspark. If we execute configuration based on the below command on Jupyter notebooks that is work.
%%configure -f
{
"name": "synapseml",
"conf": {
"spark.jars.packages": "com.microsoft.azure:synapseml_2.12:1.0.9-spark3.5",
"spark.jars.repositories": "https://mmlspark.azureedge.net/maven"
}
}
But in production, we don't use Jupyter notebooks. Therefore, we first download corresponding jars from maven repository and copy to the path /usr/lib/spark/jars on EMR and do not work and show com.microsoft.azure.synapse.ml.isolationforest.IsolationForest does not exist in the JVM
Have anyone know what is the root cause result in this? Thank you.
Code to reproduce issue
from synapse.ml.isolationforest import IsolationForest
# print(type(IsolationForest))
hyper_params = {
'n_estimators': 100,
'max_samples': 32
'max_features': 1,
'bootstrap': False,
'contamination': 0.1,
}
isolation_forest_model = (
IsolationForest()
.setNumEstimators(hyper_params["n_estimators"])
.setBootstrap(hyper_params["bootstrap"])
.setMaxSamples(hyper_params["max_samples"])
.setMaxFeatures(hyper_params["max_features"])
.setFeaturesCol("features")
.setPredictionCol("predictedLabel")
.setScoreCol("outlierScore")
.setContamination(hyper_params["contamination"])
.setContaminationError(0.01 * hyper_params["contamination"])
)
Other info / logs
An error was encountered:
com.microsoft.azure.synapse.ml.isolationforest.IsolationForest does not exist in the JVM
Traceback (most recent call last):
File "/mnt1/yarn/usercache/livy/appcache/application_1742368398137_0002/container_1742368398137_0002_01_000001/pyspark.zip/pyspark/__init__.py", line 139, in wrapper
return func(self, **kwargs)
File "/mnt1/yarn/usercache/livy/appcache/application_1742368398137_0002/container_1742368398137_0002_01_000001/com.microsoft.azure_synapseml-core_2.12-1.0.9-spark3.5.jar/synapse/ml/isolationforest/IsolationForest.py", line 78, in __init__
self._java_obj = self._new_java_obj("com.microsoft.azure.synapse.ml.isolationforest.IsolationForest", self.uid)
File "/mnt1/yarn/usercache/livy/appcache/application_1742368398137_0002/container_1742368398137_0002_01_000001/pyspark.zip/pyspark/ml/wrapper.py", line 84, in _new_java_obj
java_obj = getattr(java_obj, name)
File "/mnt1/yarn/usercache/livy/appcache/application_1742368398137_0002/container_1742368398137_0002_01_000001/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1664, in __getattr__
raise Py4JError("{0} does not exist in the JVM".format(new_fqn))
py4j.protocol.Py4JError: com.microsoft.azure.synapse.ml.isolationforest.IsolationForest does not exist in the JVM
What component(s) does this bug affect?
- [ ]
area/cognitive: Cognitive project - [ ]
area/core: Core project - [ ]
area/deep-learning: DeepLearning project - [ ]
area/lightgbm: Lightgbm project - [ ]
area/opencv: Opencv project - [ ]
area/vw: VW project - [ ]
area/website: Website - [ ]
area/build: Project build system - [ ]
area/notebooks: Samples under notebooks folder - [ ]
area/docker: Docker usage - [x]
area/models: models related issue
What language(s) does this bug affect?
- [ ]
language/scala: Scala source code - [x]
language/python: Pyspark APIs - [ ]
language/r: R APIs - [ ]
language/csharp: .NET APIs - [ ]
language/new: Proposals for new client languages
What integration(s) does this bug affect?
- [ ]
integrations/synapse: Azure Synapse integrations - [ ]
integrations/azureml: Azure ML integrations - [ ]
integrations/databricks: Databricks integrations