raydp icon indicating copy to clipboard operation
raydp copied to clipboard

init_spark() does not work

Open Hoeze opened this issue 2 years ago • 3 comments

I tried with both raydp and raydp-nightly. I'm using pyspark=3.2.1 with ray-core=1.12.1 from conda-forge.

In [3]: import raydp

In [4]: spark = raydp.init_spark("custom_install_test", 1, 1, "500 M")
Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder
        at org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j12(Logging.scala:222)
        at org.apache.spark.internal.Logging.initializeLogging(Logging.scala:127)
        at org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:111)
        at org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
        at org.apache.spark.deploy.raydp.AppMasterEntryPoint$.initializeLogIfNecessary(AppMasterEntryPoint.scala:35)
        at org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:102)
        at org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:101)
        at org.apache.spark.deploy.raydp.AppMasterEntryPoint$.initializeLogIfNecessary(AppMasterEntryPoint.scala:35)
        at org.apache.spark.deploy.raydp.AppMasterEntryPoint$.<init>(AppMasterEntryPoint.scala:36)
        at org.apache.spark.deploy.raydp.AppMasterEntryPoint$.<clinit>(AppMasterEntryPoint.scala)
        at org.apache.spark.deploy.raydp.AppMasterEntryPoint.main(AppMasterEntryPoint.scala)
Caused by: java.lang.ClassNotFoundException: org.slf4j.impl.StaticLoggerBinder
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
        ... 11 more
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Input In [4], in <module>
----> 1 spark = raydp.init_spark("custom_install_test", 1, 1, "500 M")

File <conda_env>/lib/python3.9/site-packages/raydp/context.py:126, in init_spark(app_name, num_executors, executor_cores, executor_memory, configs)
    123 try:
    124     _global_spark_context = _SparkContext(
    125         app_name, num_executors, executor_cores, executor_memory, configs)
--> 126     return _global_spark_context.get_or_create_session()
    127 except:
    128     _global_spark_context = None

File <conda_env>/lib/python3.9/site-packages/raydp/context.py:70, in _SparkContext.get_or_create_session(self)
     68     return self._spark_session
     69 self.handle = RayDPConversionHelper.options(name=RAYDP_OBJ_HOLDER_NAME).remote()
---> 70 spark_cluster = self._get_or_create_spark_cluster()
     71 self._spark_session = spark_cluster.get_spark_session(
     72     self._app_name,
     73     self._num_executors,
     74     self._executor_cores,
     75     self._executor_memory,
     76     self._configs)
     77 return self._spark_session

File <conda_env>/lib/python3.9/site-packages/raydp/context.py:63, in _SparkContext._get_or_create_spark_cluster(self)
     61 if self._spark_cluster is not None:
     62     return self._spark_cluster
---> 63 self._spark_cluster = SparkCluster(self._configs)
     64 return self._spark_cluster

File <conda_env>/lib/python3.9/site-packages/raydp/spark/ray_cluster.py:34, in SparkCluster.__init__(self, configs)
     32 self._app_master_bridge = None
     33 self._configs = configs
---> 34 self._set_up_master(None, None)
     35 self._spark_session: SparkSession = None

File <conda_env>/lib/python3.9/site-packages/raydp/spark/ray_cluster.py:40, in SparkCluster._set_up_master(self, resources, kwargs)
     37 def _set_up_master(self, resources: Dict[str, float], kwargs: Dict[Any, Any]):
     38     # TODO: specify the app master resource
     39     self._app_master_bridge = RayClusterMaster(self._configs)
---> 40     self._app_master_bridge.start_up()

File <conda_env>/lib/python3.9/site-packages/raydp/spark/ray_cluster_master.py:54, in RayClusterMaster.start_up(self, popen_kwargs)
     52     return
     53 extra_classpath = os.pathsep.join(self._prepare_jvm_classpath())
---> 54 self._gateway = self._launch_gateway(extra_classpath, popen_kwargs)
     55 self._app_master_java_bridge = self._gateway.entry_point.getAppMasterBridge()
     56 self._set_properties()

File <conda_env>/lib/python3.9/site-packages/raydp/spark/ray_cluster_master.py:118, in RayClusterMaster._launch_gateway(self, class_path, popen_kwargs)
    115     time.sleep(0.1)
    117 if not os.path.isfile(conn_info_file):
--> 118     raise Exception("Java gateway process exited before sending its port number")
    120 with open(conn_info_file, "rb") as info:
    121     length = info.read(4)

Exception: Java gateway process exited before sending its port number

Hoeze avatar Jul 20 '22 20:07 Hoeze

Using raydp-nightly-22.7.18.dev1 with raydp.executor.extraClassPath:

Python 3.9.9 | packaged by conda-forge | (main, Dec 20 2021, 02:41:03) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.0.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import raydp
   ...: import pyspark
   ...: import os
   ...: 
   ...: spark = raydp.init_spark(
   ...:     app_name="raydp",
   ...:     num_executors=1,
   ...:     executor_cores=4,
   ...:     executor_memory="16GB",
   ...:     configs={"raydp.executor.extraClassPath": pyspark.__path__[0] + "/jars/*"},
   ...: )
(RayDPSparkMaster pid=1940981) Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
(RayDPSparkMaster pid=1940981) Setting default log level to "WARN".
(RayDPSparkMaster pid=1940981) To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-07-20 22:30:18,374 WARNING worker.py:1382 -- Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 656, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 697, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 663, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 667, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 614, in ray._raylet.execute_task.function_executor
  File "<conda_env>/lib/python3.9/site-packages/ray/_private/function_manager.py", line 701, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "<conda_env>/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py", line 462, in _resume_span
    return method(self, *_args, **_kwargs)
  File "<conda_env>/lib/python3.9/site-packages/raydp/spark/ray_cluster_master.py", line 53, in start_up
    self._set_properties()
  File "<conda_env>/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py", line 462, in _resume_span
    return method(self, *_args, **_kwargs)
  File "<conda_env>/lib/python3.9/site-packages/raydp/spark/ray_cluster_master.py", line 168, in _set_properties
    self._app_master_java_bridge.setProperties(jvm_properties)
  File "<conda_env>/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "<conda_env>/lib/python3.9/site-packages/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o0.setProperties.
: java.lang.NoClassDefFoundError: io/ray/runtime/config/RayConfig
        at org.apache.spark.deploy.raydp.AppMasterJavaBridge.setProperties(AppMasterJavaBridge.scala:35)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: io.ray.runtime.config.RayConfig
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
        ... 12 more


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 797, in ray._raylet.task_execution_handler
  File "python/ray/_raylet.pyx", line 616, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 752, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 2015, in ray._raylet.CoreWorker.store_task_outputs
  File "<conda_env>/lib/python3.9/site-packages/ray/serialization.py", line 413, in serialize
    return self._serialize_to_msgpack(value)
  File "<conda_env>/lib/python3.9/site-packages/ray/serialization.py", line 368, in _serialize_to_msgpack
    value = value.to_bytes()
  File "<conda_env>/lib/python3.9/site-packages/ray/exceptions.py", line 24, in to_bytes
    serialized_exception=pickle.dumps(self),
  File "<conda_env>/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "<conda_env>/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 620, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.RLock' object
An unexpected internal error occurred while the worker was executing a task.
2022-07-20 22:30:18,381 WARNING worker.py:1382 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff71a6b004bef656ea578e102101000000 Worker ID: 03676556aa61f6a3d97067081653778982520665c1914c4cc2468781 Node ID: 52e3b7653152c55772e0ca00204028847292bec0c6e44c80f71097dd Worker IP address: 192.168.16.24 Worker port: 38195 Worker PID: 1940981
(RayDPSparkMaster pid=1940981) 2022-07-20 22:30:18,373  ERROR worker.py:449 -- SystemExit was raised from the worker.
(RayDPSparkMaster pid=1940981) Traceback (most recent call last):
(RayDPSparkMaster pid=1940981)   File "python/ray/_raylet.pyx", line 656, in ray._raylet.execute_task
(RayDPSparkMaster pid=1940981)   File "python/ray/_raylet.pyx", line 697, in ray._raylet.execute_task
(RayDPSparkMaster pid=1940981)   File "python/ray/_raylet.pyx", line 663, in ray._raylet.execute_task
(RayDPSparkMaster pid=1940981)   File "python/ray/_raylet.pyx", line 667, in ray._raylet.execute_task
(RayDPSparkMaster pid=1940981)   File "python/ray/_raylet.pyx", line 614, in ray._raylet.execute_task.function_executor
(RayDPSparkMaster pid=1940981)   File "<conda_env>/lib/python3.9/site-packages/ray/_private/function_manager.py", line 701, in actor_method_executor
(RayDPSparkMaster pid=1940981)     return method(__ray_actor, *args, **kwargs)
(RayDPSparkMaster pid=1940981)   File "<conda_env>/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py", line 462, in _resume_span
(RayDPSparkMaster pid=1940981)     return method(self, *_args, **_kwargs)
(RayDPSparkMaster pid=1940981)   File "<conda_env>/lib/python3.9/site-packages/raydp/spark/ray_cluster_master.py", line 53, in start_up
(RayDPSparkMaster pid=1940981)     self._set_properties()
(RayDPSparkMaster pid=1940981)   File "<conda_env>/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py", line 462, in _resume_span
(RayDPSparkMaster pid=1940981)     return method(self, *_args, **_kwargs)
(RayDPSparkMaster pid=1940981)   File "<conda_env>/lib/python3.9/site-packages/raydp/spark/ray_cluster_master.py", line 168, in _set_properties
(RayDPSparkMaster pid=1940981)     self._app_master_java_bridge.setProperties(jvm_properties)
(RayDPSparkMaster pid=1940981)   File "<conda_env>/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
(RayDPSparkMaster pid=1940981)     return_value = get_return_value(
(RayDPSparkMaster pid=1940981)   File "<conda_env>/lib/python3.9/site-packages/py4j/protocol.py", line 326, in get_return_value
(RayDPSparkMaster pid=1940981)     raise Py4JJavaError(
(RayDPSparkMaster pid=1940981) py4j.protocol.Py4JJavaError: An error occurred while calling o0.setProperties.
(RayDPSparkMaster pid=1940981) : java.lang.NoClassDefFoundError: io/ray/runtime/config/RayConfig
(RayDPSparkMaster pid=1940981)  at org.apache.spark.deploy.raydp.AppMasterJavaBridge.setProperties(AppMasterJavaBridge.scala:35)
(RayDPSparkMaster pid=1940981)  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
(RayDPSparkMaster pid=1940981)  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
(RayDPSparkMaster pid=1940981)  at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
(RayDPSparkMaster pid=1940981)  at java.base/java.lang.reflect.Method.invoke(Method.java:566)
(RayDPSparkMaster pid=1940981)  at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
(RayDPSparkMaster pid=1940981)  at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
(RayDPSparkMaster pid=1940981)  at py4j.Gateway.invoke(Gateway.java:282)
(RayDPSparkMaster pid=1940981)  at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
(RayDPSparkMaster pid=1940981)  at py4j.commands.CallCommand.execute(CallCommand.java:79)
(RayDPSparkMaster pid=1940981)  at py4j.GatewayConnection.run(GatewayConnection.java:238)
(RayDPSparkMaster pid=1940981)  at java.base/java.lang.Thread.run(Thread.java:829)
(RayDPSparkMaster pid=1940981) Caused by: java.lang.ClassNotFoundException: io.ray.runtime.config.RayConfig
(RayDPSparkMaster pid=1940981)  at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
(RayDPSparkMaster pid=1940981)  at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
(RayDPSparkMaster pid=1940981)  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
(RayDPSparkMaster pid=1940981)  ... 12 more
(RayDPSparkMaster pid=1940981) 
(RayDPSparkMaster pid=1940981) 
(RayDPSparkMaster pid=1940981) During handling of the above exception, another exception occurred:
(RayDPSparkMaster pid=1940981) 
(RayDPSparkMaster pid=1940981) Traceback (most recent call last):
(RayDPSparkMaster pid=1940981)   File "python/ray/_raylet.pyx", line 797, in ray._raylet.task_execution_handler
(RayDPSparkMaster pid=1940981)   File "python/ray/_raylet.pyx", line 616, in ray._raylet.execute_task
(RayDPSparkMaster pid=1940981)   File "python/ray/_raylet.pyx", line 752, in ray._raylet.execute_task
(RayDPSparkMaster pid=1940981)   File "python/ray/_raylet.pyx", line 2015, in ray._raylet.CoreWorker.store_task_outputs
(RayDPSparkMaster pid=1940981)   File "<conda_env>/lib/python3.9/site-packages/ray/serialization.py", line 413, in serialize
(RayDPSparkMaster pid=1940981)     return self._serialize_to_msgpack(value)
(RayDPSparkMaster pid=1940981)   File "<conda_env>/lib/python3.9/site-packages/ray/serialization.py", line 368, in _serialize_to_msgpack
(RayDPSparkMaster pid=1940981)     value = value.to_bytes()
(RayDPSparkMaster pid=1940981)   File "<conda_env>/lib/python3.9/site-packages/ray/exceptions.py", line 24, in to_bytes
(RayDPSparkMaster pid=1940981)     serialized_exception=pickle.dumps(self),
(RayDPSparkMaster pid=1940981)   File "<conda_env>/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
(RayDPSparkMaster pid=1940981)     cp.dump(obj)
(RayDPSparkMaster pid=1940981)   File "<conda_env>/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 620, in dump
(RayDPSparkMaster pid=1940981)     return Pickler.dump(self, obj)
(RayDPSparkMaster pid=1940981) TypeError: cannot pickle '_thread.RLock' object
(RayDPSparkMaster pid=1940981) 
(RayDPSparkMaster pid=1940981) During handling of the above exception, another exception occurred:
(RayDPSparkMaster pid=1940981) 
(RayDPSparkMaster pid=1940981) Traceback (most recent call last):
(RayDPSparkMaster pid=1940981)   File "python/ray/_raylet.pyx", line 826, in ray._raylet.task_execution_handler
(RayDPSparkMaster pid=1940981) SystemExit
---------------------------------------------------------------------------
RayActorError                             Traceback (most recent call last)
Input In [1], in <module>
      2 import pyspark
      3 import os
----> 5 spark = raydp.init_spark(
      6     app_name="raydp",
      7     num_executors=1,
      8     executor_cores=4,
      9     executor_memory="16GB",
     10     configs={"raydp.executor.extraClassPath": pyspark.__path__[0] + "/jars/*"},
     11 )

File <conda_env>/lib/python3.9/site-packages/raydp/context.py:200, in init_spark(app_name, num_executors, executor_cores, executor_memory, enable_hive, placement_group_strategy, placement_group, placement_group_bundle_indexes, configs)
    193 try:
    194     _global_spark_context = _SparkContext(
    195         app_name, num_executors, executor_cores, executor_memory, enable_hive,
    196         placement_group_strategy,
    197         placement_group,
    198         placement_group_bundle_indexes,
    199         configs)
--> 200     return _global_spark_context.get_or_create_session()
    201 except:
    202     _global_spark_context = None

File <conda_env>/lib/python3.9/site-packages/raydp/context.py:118, in _SparkContext.get_or_create_session(self)
    116 self._prepare_placement_group()
    117 spark_cluster = self._get_or_create_spark_cluster()
--> 118 self._spark_session = spark_cluster.get_spark_session(
    119     self._app_name,
    120     self._num_executors,
    121     self._executor_cores,
    122     self._executor_memory,
    123     self._enable_hive,
    124     self._configs)
    125 return self._spark_session

File <conda_env>/lib/python3.9/site-packages/raydp/spark/ray_cluster.py:87, in SparkCluster.get_spark_session(self, app_name, num_executors, executor_cores, executor_memory, enable_hive, extra_conf)
     84 if enable_hive:
     85     spark_builder.enableHiveSupport()
     86 self._spark_session = \
---> 87     spark_builder.appName(app_name).master(self.get_cluster_url()).getOrCreate()
     88 return self._spark_session

File <conda_env>/lib/python3.9/site-packages/raydp/spark/ray_cluster.py:48, in SparkCluster.get_cluster_url(self)
     47 def get_cluster_url(self) -> str:
---> 48     return ray.get(self._spark_master_handle.get_master_url.remote())

File <conda_env>/lib/python3.9/site-packages/ray/_private/client_mode_hook.py:105, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
    103     if func.__name__ != "init" or is_client_mode_enabled_by_default:
    104         return getattr(ray, func.__name__)(*args, **kwargs)
--> 105 return func(*args, **kwargs)

File <conda_env>/lib/python3.9/site-packages/ray/worker.py:1811, in get(object_refs, timeout)
   1809             raise value.as_instanceof_cause()
   1810         else:
-> 1811             raise value
   1813 if is_individual_id:
   1814     values = values[0]

RayActorError: The actor died unexpectedly before finishing this task.
        class_name: RayDPSparkMaster
        actor_id: 71a6b004bef656ea578e102101000000
        pid: 1940981
        name: RAYDP_SPARK_MASTER
        namespace: 2bf249cf-7492-4ee0-9ddb-b16f64e60c88
        ip: 192.168.16.24
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR_EXIT

Hoeze avatar Jul 20 '22 20:07 Hoeze

Apparently it seems to be an issue with the ray-core package. Installing ray via pip seems to work...

Hoeze avatar Jul 21 '22 09:07 Hoeze

@Hoeze ,thanks for the feedback and update!

carsonwang avatar Jul 21 '22 09:07 carsonwang