raydp
raydp copied to clipboard
Support for spark.jars.packages is broken?
The issue is the same as this one I believe.
tldr; If developer specify jar dependencies through spark.jars.packages
these packages won't be distributed to worker node (or maybe distributed not in the class path). One simple way to test this out is run raydp to read a file from S3 by following code:
import ray
import raydp
ray.init(address='auto')
spark = raydp.init_spark(app_name='RayDP Example',
num_executors=2,
executor_cores=4,
executor_memory=8 * 1024 * 1024 * 1024,
configs={'spark.jars.packages': 'org.apache.hadoop:hadoop-aws:3.3.1'}
)
# Read from S3
df = spark.read.json(path='s3a://....')
df.show()
and you will get error:
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2571)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2665)
... 40 more
If using a script to download the jar in setup commands, then set the path with spark.jars
it will work. But I feel this is not the best thing one should do.
Any help will be appreciated.
Hi @pang-wu , thanks for opening this issue. We'll look into this. It seems like some options are not copied to executors.
Awesome! Please let me know if you need anything from me, we are actively using RayDP on a lot of things, looking forward the tool getting better! @kira-lin
Did you ever figure out how to get S3AFileSystem recognized by the executors?
I needed something similar for GCS, what worked for me is to add the following to setup_commands
:
- mkdir -p ~/jars
- wget -P ~/jars https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop3-2.2.14/gcs-connector-hadoop3-2.2.14-shaded.jar
and then in the spark conf:
"spark.driver.extraClassPath": "/home/ubuntu/jars/gcs-connector-hadoop3-2.2.14-shaded.jar",
"spark.executor.extraClassPath": "/home/ubuntu/jars/gcs-connector-hadoop3-2.2.14-shaded.jar",
"spark.hadoop.fs.AbstractFileSystem.gs.impl":"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS",
"spark.hadoop.fs.gs.impl":"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
It looks like hadoop-aws
does not release a shaded JAR (see https://issues.apache.org/jira/browse/HADOOP-15387 and https://issues.apache.org/jira/browse/HADOOP-15924), but one path forward would be to make your own shaded JAR and then repeat similar steps.
I also noticed that when using spark.jars.packages
, the JARs were placed in ~/.ivy2/jars
so I added ~/.ivy2/jars/*
to the extraClassPath
but that did not work for some reason, perhaps someone else will have better luck than me :(.
Did you ever figure out how to get S3AFileSystem recognized by the executors?
@abhay-agarwal We have a working solution in prod (AWS S3), by placing the jars needed under the jars
directory in our standalone spark deployment (We are not using pyspark installed from pip). In such cases no extra class path needs to be set (we also do the same to enable Glue support). Some scripts for your reference:
wget -nc https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar -P "$SPARK_INSTALL_DIR/jars"
wget -nc https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.901/aws-java-sdk-bundle-1.11.901.jar -P "$SPARK_INSTALL_DIR/jars"
wget -nc https://repo1.maven.org/maven2/org/wildfly/openssl/wildfly-openssl/1.0.7.Final/wildfly-openssl-1.0.7.Final.jar -P "$SPARK_INSTALL_DIR/jars"
wget -nc https://repo1.maven.org/maven2/org/apache/spark/spark-hadoop-cloud_2.12/${SPARK_VERSION}/spark-hadoop-cloud_2.12-${SPARK_VERSION}.jar -P "$SPARK_INSTALL_DIR/jars"
Did you ever figure out how to get S3AFileSystem recognized by the executors?
@abhay-agarwal We have a working solution in prod (AWS S3), by placing the jars needed under the
jars
directory in our standalone spark deployment (We are not using pyspark installed from pip). In such cases no extra class path needs to be set (we also do the same to enable Glue support). Some scripts for your reference:wget -nc https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar -P "$SPARK_INSTALL_DIR/jars" wget -nc https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.901/aws-java-sdk-bundle-1.11.901.jar -P "$SPARK_INSTALL_DIR/jars" wget -nc https://repo1.maven.org/maven2/org/wildfly/openssl/wildfly-openssl/1.0.7.Final/wildfly-openssl-1.0.7.Final.jar -P "$SPARK_INSTALL_DIR/jars" wget -nc https://repo1.maven.org/maven2/org/apache/spark/spark-hadoop-cloud_2.12/${SPARK_VERSION}/spark-hadoop-cloud_2.12-${SPARK_VERSION}.jar -P "$SPARK_INSTALL_DIR/jars"
Cool! This is exactly what we ended up doing as well haha. We use docker on KubeRay so just placed similar commands in the docker files.
Gentle ping @pang-wu , sorry for the late reply to this issue. I tested with the code you provided above and have 2 nodes for this test:
Node A is ray head node, downloaded all the 4 jars to local and client run on this node Node B is the worker node, without any jar and raydp-java-worker run on this node
The result is JSON files in my bucket can be read correctly and everything is working well. So I looked into raydp-java-worker log to find out how jars are copied to other worker nodes. Here are 4 steps as I concluded:
Step 1: Client will open a port like
spark://IP:PORT
. Step 2: Executor will connect to the port and fetch jars to a local temporary directory. Step 3: Executor will copy jars from temporary location to working directory of this executor. Step 4: Jars will be loaded through class loader.
This is the executor log about hadoop_hadoop-aws-3.3.1.jar
:
2023-07-13 06:11:19,076 INFO Executor [dispatcher-Executor]: Fetching spark://10.0.2.140:44539/jars/org.apache.hadoop_hadoop-aws-3.3.1.jar with timestamp 1689228852787
2023-07-13 06:11:19,077 INFO Utils [dispatcher-Executor]: Fetching spark://10.0.2.140:44539/jars/org.apache.hadoop_hadoop-aws-3.3.1.jar to /tmp/ray/session_2023-07-13_06-12-37_355158_3120799/app-20230713061112806/1/spark-3b7a118e-a4b0-4e2a-a92b-44da3b8586ad/fetchFileTemp2388565500155871038.tmp
2023-07-13 06:11:19,089 INFO Utils [dispatcher-Executor]: Copying /tmp/ray/session_2023-07-13_06-12-37_355158_3120799/app-20230713061112806/1/spark-3b7a118e-a4b0-4e2a-a92b-44da3b8586ad/-16085021461689228852787_cache to /tmp/ray/session_2023-07-13_06-12-37_355158_3120799/app-20230713061112806/1/_tmp/org.apache.hadoop_hadoop-aws-3.3.1.jar
2023-07-13 06:11:19,094 INFO Executor [dispatcher-Executor]: Adding file:/tmp/ray/session_2023-07-13_06-12-37_355158_3120799/app-20230713061112806/1/_tmp/org.apache.hadoop_hadoop-aws-3.3.1.jar to class loader
IIUC, this issue is no longer existed right?
I tried with Spark 3.3.2 and Spark 3.2.1, both of them passed the test. Got errors(same as issue) on all Spark 3.1.x since Hadoop jars are not match.
@Deegue Thanks for circle back, just tested on our side (Spark 3.3.2+Ray 2.4.0+RayDP master custom build) and this is fixed! I will close this issue.
Ivy Default Cache set to: /home/ubuntu/.ivy2/cache
The jars for the packages stored in: /home/ubuntu/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-b015f014-fa76-426b-a3b9-e17cee6f7b26;1.0
confs: [default]
found org.apache.hadoop#hadoop-aws;3.3.1 in central
found com.amazonaws#aws-java-sdk-bundle;1.11.901 in central
found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
downloading https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar ...
[SUCCESSFUL ] org.apache.hadoop#hadoop-aws;3.3.1!hadoop-aws.jar (100ms)
downloading https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.901/aws-java-sdk-bundle-1.11.901.jar ...
[SUCCESSFUL ] com.amazonaws#aws-java-sdk-bundle;1.11.901!aws-java-sdk-bundle.jar (1500ms)
downloading https://repo1.maven.org/maven2/org/wildfly/openssl/wildfly-openssl/1.0.7.Final/wildfly-openssl-1.0.7.Final.jar ...
[SUCCESSFUL ] org.wildfly.openssl#wildfly-openssl;1.0.7.Final!wildfly-openssl.jar (12ms)
:: resolution report :: resolve 3492ms :: artifacts dl 1617ms
:: modules in use:
com.amazonaws#aws-java-sdk-bundle;1.11.901 from central in [default]
org.apache.hadoop#hadoop-aws;3.3.1 from central in [default]
org.wildfly.openssl#wildfly-openssl;1.0.7.Final from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 3 | 3 | 3 | 0 || 3 | 3 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-b015f014-fa76-426b-a3b9-e17cee6f7b26
confs: [default]
3 artifacts copied, 0 already retrieved (189078kB/165ms)