SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

How to install to EMR from maven repository to /usr/lib/spark/jars

Open hueiyuan opened this issue 8 months ago • 0 comments

SynapseML version

1.0.10

System information

  • Language version (e.g. python 3.8, scala 2.12): python 3.9
  • Spark Version (e.g. 3.2.3): 3.5.1
  • Spark Platform (e.g. Synapse, Databricks): AWS EMR Release 7.3.1

Describe the problem

Now I would like to try to install SynapseML to EMR for pyspark. If we execute configuration based on the below command on Jupyter notebooks that is work.

%%configure -f
{
  "name": "synapseml",
  "conf": {
      "spark.jars.packages": "com.microsoft.azure:synapseml_2.12:1.0.9-spark3.5",
      "spark.jars.repositories": "https://mmlspark.azureedge.net/maven"
  }
}

But in production, we don't use Jupyter notebooks. Therefore, we first download corresponding jars from maven repository and copy to the path /usr/lib/spark/jars on EMR and do not work and show com.microsoft.azure.synapse.ml.isolationforest.IsolationForest does not exist in the JVM

Have anyone know what is the root cause result in this? Thank you.

Code to reproduce issue

from synapse.ml.isolationforest import IsolationForest

# print(type(IsolationForest))
hyper_params = {
    'n_estimators': 100,
    'max_samples': 32
    'max_features': 1,
    'bootstrap': False,
    'contamination': 0.1,    
}

isolation_forest_model = (
    IsolationForest()
    .setNumEstimators(hyper_params["n_estimators"])
    .setBootstrap(hyper_params["bootstrap"])
    .setMaxSamples(hyper_params["max_samples"])
    .setMaxFeatures(hyper_params["max_features"])
    .setFeaturesCol("features")
    .setPredictionCol("predictedLabel")
    .setScoreCol("outlierScore")
    .setContamination(hyper_params["contamination"])
    .setContaminationError(0.01 * hyper_params["contamination"])
)

Other info / logs

An error was encountered:
com.microsoft.azure.synapse.ml.isolationforest.IsolationForest does not exist in the JVM
Traceback (most recent call last):
  File "/mnt1/yarn/usercache/livy/appcache/application_1742368398137_0002/container_1742368398137_0002_01_000001/pyspark.zip/pyspark/__init__.py", line 139, in wrapper
    return func(self, **kwargs)
  File "/mnt1/yarn/usercache/livy/appcache/application_1742368398137_0002/container_1742368398137_0002_01_000001/com.microsoft.azure_synapseml-core_2.12-1.0.9-spark3.5.jar/synapse/ml/isolationforest/IsolationForest.py", line 78, in __init__
    self._java_obj = self._new_java_obj("com.microsoft.azure.synapse.ml.isolationforest.IsolationForest", self.uid)
  File "/mnt1/yarn/usercache/livy/appcache/application_1742368398137_0002/container_1742368398137_0002_01_000001/pyspark.zip/pyspark/ml/wrapper.py", line 84, in _new_java_obj
    java_obj = getattr(java_obj, name)
  File "/mnt1/yarn/usercache/livy/appcache/application_1742368398137_0002/container_1742368398137_0002_01_000001/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1664, in __getattr__
    raise Py4JError("{0} does not exist in the JVM".format(new_fqn))
py4j.protocol.Py4JError: com.microsoft.azure.synapse.ml.isolationforest.IsolationForest does not exist in the JVM

What component(s) does this bug affect?

  • [ ] area/cognitive: Cognitive project
  • [ ] area/core: Core project
  • [ ] area/deep-learning: DeepLearning project
  • [ ] area/lightgbm: Lightgbm project
  • [ ] area/opencv: Opencv project
  • [ ] area/vw: VW project
  • [ ] area/website: Website
  • [ ] area/build: Project build system
  • [ ] area/notebooks: Samples under notebooks folder
  • [ ] area/docker: Docker usage
  • [x] area/models: models related issue

What language(s) does this bug affect?

  • [ ] language/scala: Scala source code
  • [x] language/python: Pyspark APIs
  • [ ] language/r: R APIs
  • [ ] language/csharp: .NET APIs
  • [ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • [ ] integrations/synapse: Azure Synapse integrations
  • [ ] integrations/azureml: Azure ML integrations
  • [ ] integrations/databricks: Databricks integrations

hueiyuan avatar Mar 19 '25 07:03 hueiyuan