raydp
raydp copied to clipboard
java.lang.StackOverflowError upon raydp.init_spark(
Platform: ray autoscaled cluster running on K8S (image is based off of rayproject/ray:1.4.1-py37 raydp: raydp-nightly 2021.7.1.dev0 java version: openjdk version "1.8.0_252"
When trying to run the following code:
import ray
import raydp
ray.init(address="auto")
spark = raydp.init_spark(
"word_count", num_executors=2, executor_cores=2, executor_memory="1G"
)
df = spark.createDataFrame(
[("look",), ("spark",), ("tutorial",), ("spark",), ("look",), ("python",)], ["word"]
)
df.show()
word_count = df.groupBy("word").count()
word_count.show()
raydp.stop_spark()
I get the following error:
2021-07-27 15:20:59,651 INFO worker.py:736 -- Connecting to existing Ray cluster at address: XX.XX.XX.XX:6379
2021-07-27 15:21:01,670 Thread-3 DEBUG null null initializing configuration org.apache.logging.log4j.core.config.builder.impl.BuiltConfiguration@31a8a22c
2021-07-27 15:21:01,671 Thread-3 DEBUG Installed 2 script engines
Traceback (most recent call last):
File "/data/forecasting/fctk/hacking/ray/raydp/wordcount.py", line 7, in <module>
"word_count", num_executors=2, executor_cores=2, executor_memory="1G"
File "/home/ray/.local/lib/python3.7/site-packages/raydp/context.py", line 122, in init_spark
return _global_spark_context.get_or_create_session()
File "/home/ray/.local/lib/python3.7/site-packages/raydp/context.py", line 68, in get_or_create_session
spark_cluster = self._get_or_create_spark_cluster()
File "/home/ray/.local/lib/python3.7/site-packages/raydp/context.py", line 62, in _get_or_create_spark_cluster
self._spark_cluster = SparkCluster(self._configs)
File "/home/ray/.local/lib/python3.7/site-packages/raydp/spark/ray_cluster.py", line 34, in __init__
self._set_up_master(None, None)
File "/home/ray/.local/lib/python3.7/site-packages/raydp/spark/ray_cluster.py", line 40, in _set_up_master
self._app_master_bridge.start_up()
File "/home/ray/.local/lib/python3.7/site-packages/raydp/spark/ray_cluster_master.py", line 59, in start_up
self._create_app_master(extra_classpath)
File "/home/ray/.local/lib/python3.7/site-packages/raydp/spark/ray_cluster_master.py", line 169, in _create_app_master
self._app_master_java_bridge.startUpAppMaster(extra_classpath)
File "/home/ray/.local/lib/python3.7/site-packages/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/ray/.local/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o0.startUpAppMaster.
: java.lang.StackOverflowError
at scala.reflect.io.ZipArchive$.scala$reflect$io$ZipArchive$$dirName(ZipArchive.scala:58)
at scala.reflect.io.ZipArchive.ensureDir(ZipArchive.scala:114)
at scala.reflect.io.ZipArchive.ensureDir(ZipArchive.scala:114)
[snip]
Any help would be appreciated, thanks.
@ssiegel95 , can you please check if you have Java installed and JAVA_HOME set properly? can you also try to run pyspark to see if you encounter the same error?
@carsonwang Thanks for the reply. Below please find three checks:
- confirming
JAVA_HOME
is defined and pointing to a valid directory - confirming pyspark works and default spark context (local mode) is valid
- trying to create a spark context using raydp.init_spark (which fails with a different signature as I reported above)
Note that this is on kubernetes with a ray autoscaled cluster instance.
(base) ray@example-cluster-ray-head-w58gg:/opt/java$ echo $JAVA_HOME && ls $JAVA_HOME
/opt/java
ASSEMBLY_EXCEPTION LICENSE THIRD_PARTY_README bin lib man release
(base) ray@example-cluster-ray-head-w58gg:~$ pyspark
Python 3.7.7 (default, May 7 2020, 21:25:33)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
21/07/28 13:14:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.0.0
/_/
Using Python version 3.7.7 (default, May 7 2020 21:25:33)
SparkSession available as 'spark'.
>>> sc
<SparkContext master=local[*] appName=PySparkShell>
>>>
>>> sc.stop()
>>> import ray
>>> import raydp
>>> ray.__version__
'1.4.1'
>>> raydp.__version__
'0.4.0.dev0'
>>> ray.init(address="auto")
2021-07-28 06:20:07,047 INFO worker.py:736 -- Connecting to existing Ray cluster at address: 172.30.37.18:6379
{'node_ip_address': '172.30.37.18', 'raylet_ip_address': '172.30.37.18', 'redis_address': '172.30.37.18:6379', 'object_store_address': '/tmp/ray/session_2021-07-27_06-27-23_004119_147/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2021-07-27_06-27-23_004119_147/sockets/raylet', 'webui_url': '172.30.37.18:8265', 'session_dir': '/tmp/ray/session_2021-07-27_06-27-23_004119_147', 'metrics_export_port': 61833, 'node_id': '2c5fc2e00f06eabb67db39a7df0a0a7b1845750e7d0dd2ef7f74cf03'}
>>> spark = raydp.init_spark(
... "word_count", num_executors=2, executor_cores=2, executor_memory="1G"
... )
2021-07-28 06:20:35,394 Thread-3 DEBUG null null initializing configuration org.apache.logging.log4j.core.config.builder.impl.BuiltConfiguration@31a8a22c
2021-07-28 06:20:35,395 Thread-3 DEBUG Installed 2 script engines
Fatal Python error: Cannot recover from stack overflow.
Thread 0x00007efa5effd700 (most recent call first):
File "/home/ray/anaconda3/lib/python3.7/threading.py", line 300 in wait
File "/home/ray/anaconda3/lib/python3.7/threading.py", line 552 in wait
File "/home/ray/.local/lib/python3.7/site-packages/ray/worker.py", line 444 in print_logs
File "/home/ray/anaconda3/lib/python3.7/threading.py", line 870 in run
File "/home/ray/anaconda3/lib/python3.7/threading.py", line 926 in _bootstrap_inner
File "/home/ray/anaconda3/lib/python3.7/threading.py", line 890 in _bootstrap
Thread 0x00007efa5f7fe700 (most recent call first):
File "/home/ray/anaconda3/lib/python3.7/threading.py", line 300 in wait
File "/home/ray/anaconda3/lib/python3.7/threading.py", line 552 in wait
File "/home/ray/.local/lib/python3.7/site-packages/ray/worker.py", line 1105 in listen_error_messages_raylet
File "/home/ray/anaconda3/lib/python3.7/threading.py", line 870 in run
File "/home/ray/anaconda3/lib/python3.7/threading.py", line 926 in _bootstrap_inner
File "/home/ray/anaconda3/lib/python3.7/threading.py", line 890 in _bootstrap
Thread 0x00007efa5ffff700 (most recent call first):
File "/home/ray/anaconda3/lib/python3.7/threading.py", line 300 in wait
File "/home/ray/anaconda3/lib/python3.7/threading.py", line 552 in wait
File "/home/ray/.local/lib/python3.7/site-packages/ray/_private/import_thread.py", line 75 in _run
File "/home/ray/anaconda3/lib/python3.7/threading.py", line 870 in run
File "/home/ray/anaconda3/lib/python3.7/threading.py", line 926 in _bootstrap_inner
File "/home/ray/anaconda3/lib/python3.7/threading.py", line 890 in _bootstrap
Current thread 0x00007efccedf8740 (most recent call first):
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 473 in __init__
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 497 in __init__
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 497 in __init__
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 497 in __init__
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 497 in __init__
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 497 in __init__
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 497 in __init__
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 497 in __init__
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 497 in __init__
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 497 in __init__
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 497 in __init__
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 497 in __init__
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 497 in __init__
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 497 in __init__
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 497 in __init__
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 497 in __init__
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 497 in __init__
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 497 in __init__
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 497 in __init__
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 497 in __init__
File "/home/ray/anaconda3/lib/python3.7/traceback.py", line 104 in print_exception
File "/home/ray/anaconda3/lib/python3.7/logging/__init__.py", line 566 in formatException
File "/home/ray/anaconda3/lib/python3.7/logging/__init__.py", line 616 in format
File "/home/ray/anaconda3/lib/python3.7/logging/__init__.py", line 869 in format
File "/home/ray/anaconda3/lib/python3.7/logging/__init__.py", line 1025 in emit
File "/home/ray/anaconda3/lib/python3.7/logging/__init__.py", line 894 in handle
File "/home/ray/anaconda3/lib/python3.7/logging/__init__.py", line 1586 in callHandlers
File "/home/ray/anaconda3/lib/python3.7/logging/__init__.py", line 1524 in handle
File "/home/ray/anaconda3/lib/python3.7/logging/__init__.py", line 1514 in _log
File "/home/ray/anaconda3/lib/python3.7/logging/__init__.py", line 1407 in error
File "/home/ray/anaconda3/lib/python3.7/logging/__init__.py", line 1959 in error
File "/home/ray/anaconda3/lib/python3.7/logging/__init__.py", line 1967 in exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1051 in send_command
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1303 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 98 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 103 in convert_exception
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 133 in deco
File "/home/ray/.local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
...
Aborted
Not sure of this is relevant but my min_workers at is set to zero and when I tried this, there were no worker pods running yet. I'm only mentioning this because I recall seeing in some raydp issue in the past that there was an issue when an autoscaled cluster used min_workers: 0
.
Follow up: I tried above with min_workers set to 2 and got the same error.
It seems this is related to a bug in scala. But not sure when this could happen. I've upgraded scala to 2.12.14. please see if the new nightly release will fix the issue. Otherwise if you can share how you built your image and help us reproduce the issue, we'll further look into it.
Thanks Carson. After installing the latest nightly release I see an error identical to my original one, namely:
py4j.protocol.Py4JJavaError: An error occurred while calling o0.startUpAppMaster.
: java.lang.StackOverflowError
at scala.reflect.io.ZipArchive.ensureDir(ZipArchive.scala:112)
at scala.reflect.io.ZipArchive.ensureDir(ZipArchive.scala:114)
at scala.reflect.io.ZipArchive.ensureDir(ZipArchive.scala:114)
at scala.reflect.io.ZipArchive.ensureDir(ZipArchive.scala:114)
I'm attaching a zip archive with a Dockerfile and a trivial test script that can be used to recreate the issue.
docker build -t issue168 -f Dockerfile.issue168 .
docker run -it issue168 bash
ray start --head && python issue168.py
2021-07-29 08:23:49,876 Thread-2 DEBUG Installed 2 script engines
Traceback (most recent call last):
File "issue168.py", line 7, in <module>
"word_count", num_executors=2, executor_cores=2, executor_memory="1G"
File "/home/ray/.local/lib/python3.7/site-packages/raydp/context.py", line 122, in init_spark
return _global_spark_context.get_or_create_session()
File "/home/ray/.local/lib/python3.7/site-packages/raydp/context.py", line 68, in get_or_create_session
spark_cluster = self._get_or_create_spark_cluster()
File "/home/ray/.local/lib/python3.7/site-packages/raydp/context.py", line 62, in _get_or_create_spark_cluster
self._spark_cluster = SparkCluster(self._configs)
File "/home/ray/.local/lib/python3.7/site-packages/raydp/spark/ray_cluster.py", line 34, in __init__
self._set_up_master(None, None)
File "/home/ray/.local/lib/python3.7/site-packages/raydp/spark/ray_cluster.py", line 40, in _set_up_master
self._app_master_bridge.start_up()
File "/home/ray/.local/lib/python3.7/site-packages/raydp/spark/ray_cluster_master.py", line 59, in start_up
self._create_app_master(extra_classpath)
File "/home/ray/.local/lib/python3.7/site-packages/raydp/spark/ray_cluster_master.py", line 169, in _create_app_master
self._app_master_java_bridge.startUpAppMaster(extra_classpath)
File "/home/ray/.local/lib/python3.7/site-packages/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/ray/.local/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o0.startUpAppMaster.
: java.lang.StackOverflowError
at scala.reflect.io.ZipArchive.ensureDir(ZipArchive.scala:114)
at scala.reflect.io.ZipArchive.ensureDir(ZipArchive.scala:114)
@ssiegel95 I forgot to mention the new nightly release is still not available yet. It will be uploaded automatically every day if there are new commit. Anyway, I'll take a look at your file.
@ssiegel95 , I was able to reproduce the issue. It is an issue with Ray 1.4.1. You can use Ray 1.4.0 to bypass the issue for now.
close as stale