spark-on-kubernetes-helm
spark-on-kubernetes-helm copied to clipboard
maven packages spark.jars.packages doesn't loaded into executers classpath
Hi
I'm having an issue while loading maven packages dependencies, while using SparkMagic and this helm chart for livy and spark in k8s.
In spark config I set the config: spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1
the dependencies downloaded into /root/.ivy2/jars but dosn't included into spark classpath and when trying to execute action I'm getting the following error:
21/01/05 11:22:15 INFO DAGScheduler: ShuffleMapStage 1 (take at
Do you have any suggestions ?
Thanks
Hi @BenMizrahiPlarium , one question to clarify your setup: do you configure spark.jars.packages
in code through SparkConf
? Or how do you do that?
Hi @jahstreet,
I did the setup using the config section in Livy create session request - in other use cases I think that the jar artifacts downloaded from maven is shared via HDFS and available in the driver and worker - in this usecase it’s only available in the driver maven local repository.
I see that the artifacts download- but doesn’t available in the executers classpath
If you have any idea - it will be very helpful:)
Currently running in the same issue.
I'm trying the session parameters bellow (which works with spark2).
I can see in the driver log messages that the package is downloaded from maven but it's not passed to the executors.
session_params = {
"kind": "pyspark",
"driverCores": 2,
"driverMemory": "12g",
"executorCores": 7,
"executorMemory": "42g",
"numExecutors": 1,
"conf": {
"spark.jars.packages": "mysql:mysql-connector-java",
},
}
The jar is also not uploaded to spark.kubernetes.file.upload.path
(using a s3 bucket), but I don't know if this is the expected behavior.
Passing the http link to file in the jars
parameter also don't work.
session_params = {
"kind": "pyspark",
"driverCores": 2,
"driverMemory": "12g",
"executorCores": 7,
"executorMemory": "42g",
"numExecutors": 1,
"jars": ["https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.48/mysql-connector-java-5.1.48.jar"]
}
I really think it’s because Spark has no shared file system between workers and driver.
As a workaround for now - I solve it using gsfuse by mounting google bucket in driver and workers and configure both maven local repository to point to this folder and add the folder to spark extra classpath.
so finally the bucket is mounted to /etc/google in both drivers and executers maven jars points to /etc/google/jars and spark config includes:
spark.driver.extraClassPath=/etc/google/jars/* spark.executer.extraClassPath=/etc/google/jars/*
and it works, but my problem with that is that it’s not temporary it’s persistent between all sessions- and it’s not isolated per user.
I understood from the documentation that spark.kubernetes.file.upload.path
should be used for sharing the files between driver and executors.
Indeed this works livy related jars.
I set mine spark.kubernetes.file.upload.path
to s3a://somosdigital-datascience/spark3/
This is the livy logs when creating a new session:
21/01/18 20:26:20 INFO LineBufferedStream: 21/01/18 20:26:20 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/asm-5.0.4.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-6bb1a42f-082e-4621-8b6c-75e99ac92a76/asm-5.0.4.jar...
21/01/18 20:26:20 INFO LineBufferedStream: 21/01/18 20:26:20 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/livy-api-0.8.0-incubating-SNAPSHOT.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-b3a531b6-8c66-4e30-9b7a-30ffbf27dd8f/livy-api-0.8.0-incubating-SNAPSHOT.jar...
21/01/18 20:26:21 INFO LineBufferedStream: 21/01/18 20:26:21 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/livy-rsc-0.8.0-incubating-SNAPSHOT.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-738f5605-e3e0-499c-971a-7f8068be346d/livy-rsc-0.8.0-incubating-SNAPSHOT.jar...
21/01/18 20:26:21 INFO LineBufferedStream: 21/01/18 20:26:21 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/livy-thriftserver-session-0.8.0-incubating-SNAPSHOT.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-0f275439-d128-4388-b59e-e66da063deb7/livy-thriftserver-session-0.8.0-incubating-SNAPSHOT.jar...
21/01/18 20:26:22 INFO LineBufferedStream: 21/01/18 20:26:22 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/minlog-1.3.0.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-6d2c2679-8e89-4096-97d0-4dae83ada91c/minlog-1.3.0.jar...
21/01/18 20:26:22 INFO LineBufferedStream: 21/01/18 20:26:22 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/netty-all-4.1.47.Final.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-7ee64aca-6e94-43c3-8131-36d62bca562a/netty-all-4.1.47.Final.jar...
21/01/18 20:26:22 INFO LineBufferedStream: 21/01/18 20:26:22 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/objenesis-2.5.1.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-82bb5496-d3aa-47cd-8a4e-58178e3d8b36/objenesis-2.5.1.jar...
21/01/18 20:26:23 INFO LineBufferedStream: 21/01/18 20:26:23 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/reflectasm-1.11.3.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-4ff31636-583b-485e-b59d-a16006798b17/reflectasm-1.11.3.jar...
21/01/18 20:26:23 INFO LineBufferedStream: 21/01/18 20:26:23 INFO KubernetesUtils: Uploading file: /opt/livy/repl_2.12-jars/commons-codec-1.9.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-5f9c3fba-8357-4773-b8da-a350f9314f02/commons-codec-1.9.jar...
21/01/18 20:26:24 INFO LineBufferedStream: 21/01/18 20:26:24 INFO KubernetesUtils: Uploading file: /opt/livy/repl_2.12-jars/livy-core_2.12-0.8.0-incubating-SNAPSHOT.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-6dc3f8a0-147d-4f71-ac57-2cdbb1c76c00/livy-core_2.12-0.8.0-incubating-SNAPSHOT.jar...
21/01/18 20:26:24 INFO LineBufferedStream: 21/01/18 20:26:24 INFO KubernetesUtils: Uploading file: /opt/livy/repl_2.12-jars/livy-repl_2.12-0.8.0-incubating-SNAPSHOT.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-4ed754bc-55ed-41f5-84df-9a846589ba89/livy-repl_2.12-0.8.0-incubating-SNAPSHOT.jar...
This is the relevant snippet from the driver configmap. I can also see from the driver and executors logs that both of them downloads the jars files:
spark.kubernetes.file.upload.path=s3a\://somosdigital-datascience/spark3/
spark.jars=s3a\://somosdigital-datascience/spark3//spark-upload-6bb1a42f-082e-4621-8b6c-75e99ac92a76/asm-5.0.4.jar,s3a\://somosdigital-datascience/spark3//spark-upload-b3a531b6-8c66-4e30-9b7a-30ffbf27dd8
f/livy-api-0.8.0-incubating-SNAPSHOT.jar,s3a\://somosdigital-datascience/spark3//spark-upload-738f5605-e3e0-499c-971a-7f8068be346d/livy-rsc-0.8.0-incubating-SNAPSHOT.jar,s3a\://somosdigital-datascience/s
park3//spark-upload-0f275439-d128-4388-b59e-e66da063deb7/livy-thriftserver-session-0.8.0-incubating-SNAPSHOT.jar,s3a\://somosdigital-datascience/spark3//spark-upload-6d2c2679-8e89-4096-97d0-4dae83ada91c/
minlog-1.3.0.jar,s3a\://somosdigital-datascience/spark3//spark-upload-7ee64aca-6e94-43c3-8131-36d62bca562a/netty-all-4.1.47.Final.jar,s3a\://somosdigital-datascience/spark3//spark-upload-82bb5496-d3aa-47
cd-8a4e-58178e3d8b36/objenesis-2.5.1.jar,s3a\://somosdigital-datascience/spark3//spark-upload-4ff31636-583b-485e-b59d-a16006798b17/reflectasm-1.11.3.jar,s3a\://somosdigital-datascience/spark3//spark-uplo
ad-5f9c3fba-8357-4773-b8da-a350f9314f02/commons-codec-1.9.jar,s3a\://somosdigital-datascience/spark3//spark-upload-6dc3f8a0-147d-4f71-ac57-2cdbb1c76c00/livy-core_2.12-0.8.0-incubating-SNAPSHOT.jar,s3a\:/
/somosdigital-datascience/spark3//spark-upload-4ed754bc-55ed-41f5-84df-9a846589ba89/livy-repl_2.12-0.8.0-incubating-SNAPSHOT.jar
but the ones passed by the jars
sessions parameters or config { spark.jars.packages }
don't get uploaded.
In don’t think that spark.kubernetes.file.upload.path is related to external maven dependencies- it’s related to the spark application jar uploaded to S3.
Spark downloading maven dependencies into local maven repository and using it in the class path.
as far as I see, jars are loaded into the driver and every work done by the driver is fully supported, but when the actual task being execute in one of the executers- task failed due to ClassNotFound exception - it means that the jar isn’t available in the executer classpath