spark-operator
spark-operator copied to clipboard
java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem
Hi, I get the following error when I submit my spark job to k8s cluster using Spark Operator.
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem
I've put my yaml file configs below. I've made sure "hadoop-aws" and "aws-java-sdk" have compatible versions. I was able to successfully run the job for the "pi.py" script that is available in the "/opt/spark/examples/src/main/python/pi.py" path of the given image container of the spark operator. However, when spark in my python script wants to read a CSV file from an AWS S3 bucket, I get the error message shown above. I've tried so many different versions of hadoop-aws, none of them resolved my issue. Could you please help me out?
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: test-spark-hadoop-aws-3.2.3
namespace: default
spec:
deps:
repositories:
- https://repo.maven.apache.org/maven2/
packages:
- org.apache.hadoop:hadoop-aws:3.2.3
- org.apache.hadoop:hadoop-common:3.2.3
- com.amazonaws:aws-java-sdk:1.11.901
sparkConf:
spark.driver.extraJavaOptions: "-Divy.cache.dir=/tmp -Divy.home=/tmp -Dcom.amazonaws.services.s3.enableV4=true"
spark.executor.extraJavaOptions: "-Dcom.amazonaws.services.s3.enableV4=true"
hadoopConf:
fs.s3a.aws.credentials.provider: "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"
fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
fs.s3a.access.key: "my AWS access key"
fs.s3a.secret.key: "my AWS secret key"
fs.s3a.endpoint: "s3.us-east-1.amazonaws.com"
type: Python
pythonVersion: "3"
mode: cluster
image: "gcr.io/spark-operator/spark-py:v3.1.1-hadoop3"
imagePullPolicy: Always
mainApplicationFile: s3a://mybucket/script.py
arguments: ['s3a://mybucket/test.csv']
sparkVersion: "3.1.1"
restartPolicy:
type: OnFailure
onFailureRetries: 3
driver:
cores: 1
coreLimit: "1200m"
memory: "4G"
labels:
version: 3.1.1
serviceAccount: my-serviceAccount
executor:
cores: 1
instances: 1
memory: "4G"
labels:
version: 3.1.1
Following on my previous comment, I should mention that the kubectl logs show the following download statistics:
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 253 | 251 | 251 | 2 || 251 | 251 |
---------------------------------------------------------------------
It also shows the following lines:
:: problems summary :: :::: ERRORS SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-main/3.2.3/hadoop-main-3.2.3.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-project/3.2.3/hadoop-project-3.2.3.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/amazonaws/aws-java-sdk-pom/1.11.901/aws-java-sdk-pom-1.11.901.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/amazonaws/aws-java-sdk/1.11.901/aws-java-sdk-1.11.901-sources.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/amazonaws/aws-java-sdk/1.11.901/aws-java-sdk-1.11.901-src.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/amazonaws/aws-java-sdk/1.11.901/aws-java-sdk-1.11.901-javadoc.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/apache/13/apache-13.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/commons/commons-parent/28/commons-parent-28.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/apache/21/apache-21.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/httpcomponents/httpcomponents-parent/11/httpcomponents-parent-11.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/httpcomponents/httpcomponents-client/4.5.13/httpcomponents-client-4.5.13.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/httpcomponents/httpcomponents-core/4.4.13/httpcomponents-core-4.4.13.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/apache/18/apache-18.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/commons/commons-parent/42/commons-parent-42.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/fasterxml/oss-parent/24/oss-parent-24.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/fasterxml/jackson/jackson-parent/2.6.2/jackson-parent-2.6.2.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/fasterxml/jackson/core/jackson-databind/2.6.7.3/jackson-databind-2.6.7.3-javadoc.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/fasterxml/oss-parent/23/oss-parent-23.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/fasterxml/jackson/jackson-parent/2.6.1/jackson-parent-2.6.1.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/sonatype/oss/oss-parent/9/oss-parent-9.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/io/netty/netty-parent/4.1.48.Final/netty-parent-4.1.48.Final.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/amazonaws/aws-java-sdk-models/1.11.901/aws-java-sdk-models-1.11.901-javadoc.jar
SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/amazonaws/aws-java-sdk-pom/1.11.22/aws-java-sdk-pom-1.11.22.jar
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS :: retrieving :: org.apache.spark#spark-submit-parent-0d835d60-447d-4a98-941b-e22aaa69903c confs: [default] 251 artifacts copied, 0 already retrieved (374746kB/445ms) 22/04/09 21:34:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 22/04/09 21:34:56 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
This might be related?
JFrog to Shut down JCenter and Bintray https://www.infoq.com/news/2021/02/jfrog-jcenter-bintray-closure/
Thanks for pointing me to the link for Bintray. What should I put into my yaml file to not downloading any thing from JCenter? In my yaml file, I set "https://repo.maven.apache.org/maven2/" for ".spec.deps.repositories". I was hoping with this setting, no reference to Bintray would be made.
Have you solved this problem yet? I met A similar problem with you. I can find Class org.apache.hadoop.fs.s3a.S3AFileSystem now, but I can't find java.lang.ClassNotFoundException: org.apache.hadoop.fs.StreamCapabilities
@Zhang-Aoqi Same Issue here. and i figured it out finally! that error does not come from your spark app. it's happend with your spark-operator pod. In my case, my spark app depends on hadoop 3.2 version, but spark-operator pod which i installed using helm has hadoop 2.7 jar files. please check if yours fine :)
@Zhang-Aoqi Same Issue here. and i figured it out finally! that error does not come from your spark app. it's happend with your spark-operator pod. In my case, my spark app depends on hadoop 3.2 version, but spark-operator pod which i installed using helm has hadoop 2.7 jar files.
Oh, I just solved this problem too. I re-swapped the spark version running in the pod. It seems that we are doing similar tasks, if you encounter any problems, welcome to communicate.
@Zhang-Aoqi Good to know! Thanx. So Did you change your spark version?
@hyungryuk I used Spark-3.0.0-bin-hadoop-3.2 as the image in pod, and added the aws-java-sdk-bundle-1.11.375.jar, hadoop-aws-3.2.0.jar. I'm a newbie. I don't know if you can understand me.
@Zhang-Aoqi Alright! if this error happens again, then try this image to build spark-operator
gcr.io/spark-operator/spark:v3.1.1-hadoop3
kind of simmilar issue here : https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1334
@hyungryuk Thanks for providing your comments. I'll check with our engineer who set up the Kubernetes cluster and installed the spark pod and I'll let you everyone in this thread about the result. I hope your solution resolves my issue.
@Zhang-Aoqi Thanks for your comments. I'll work on the resolutions suggested by @hyungryuk and will let you know the outcome.
@hyungryuk I've noticed that we have used "spark-operator-chart-1.1.19" helm chart to install spark-operator on the k8s cluster. How can I find the version of hadoop jar files that are created by this helm chart?
@farshadsm just access to your spark-operator pod and run following command :)
ls /opt/spark/jars | grep hadoop
@Zhang-Aoqi Alright! if this error happens again, then try this image to build spark-operator
gcr.io/spark-operator/spark:v3.1.1-hadoop3
kind of simmilar issue here : #1334
Ok, thank you
I met same error.
My solution is:
I build our own Spark image base on spark-3.2.1-bin-hadoop3.2.tgz
then add these jars under $SPARK_HOME/jars/
- hadoop-aws-3.3.1.jar
- aws-java-sdk-core-1.12.150.jar
- aws-java-sdk-dynamodb-1.12.187.jar
- aws-java-sdk-s3-1.12.150.jar
At present, it is still in the test stage and can work. No problems have been found for the time being.
Hope it helps
Hi would you please mind sharing the docker image. I tried copying the Jars as suggested, but I still encounter the issue :(
I created a public image :) registry.gitlab.com/upmkuhn/spark-operator:v3-2-hadoop3-aws-2
@upMKuhn allenhaozi/base-pyspark-3.2.1-py-v3.8:v0.1.0
@upMKuhn allenhaozi/base-pyspark-3.2.1-py-v3.8:v0.1.0
测试可以的,我自己也打了个镜像,比较重。能参考下你的Dockerfile么,感谢大佬。
@Xxxxxyd https://github.com/Wh1isper/spark-build Here is my dockerfile and docs try: wh1isper/spark-executor:3.4.1
I am developing sparglim
as tools to config pyspark quickly and friendly for pyspark-based app
see: https://github.com/Wh1isper/sparglim
@Zhang-Aoqi Good to know! Thanx. So Did you change your spark version?
i'm having the same error but i couldn't resolve it, could you help please ?
SICKER THAN YOUR BABA
On Wed, 2 Aug 2023, 13:48 aimendenche, @.***> wrote:
@Zhang-Aoqi https://github.com/Zhang-Aoqi Good to know! Thanx. So Did you change your spark version?
i'm having the same error but i couldn't resolve it, could you help please ?
— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1508#issuecomment-1662151433, or unsubscribe https://github.com/notifications/unsubscribe-auth/A435L3BGJZ5F3PHRTZX3RTDXTJECZANCNFSM5S727SYA . You are receiving this because you are subscribed to this thread.Message ID: <GoogleCloudPlatform/spark-on-k8s-operator/issues/1508/1662151433@ github.com>
I had same problems (org.apache.hadoop.fs.s3a.S3AFileSystem not found). When I tried:
deps:
files:
- "s3a://k8s-3c172e28d7da2e-bucket/test.jar"
Even added jars files inside image: "image-registry/spark-base-image" did not work. But I fixed this problem when I added necessary jars inside SPARK-OPERATOR pod. You can rebuild you Docker image by adding jars. I rebuild it:
FROM ghcr.io/googlecloudplatform/spark-operator:v1beta2-1.3.7-3.1.1
ENV SPARK_HOME /opt/spark
RUN curl https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar -o ${SPARK_HOME}/jars/hadoop-aws-2.7.4.jar
RUN curl https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar -o ${SPARK_HOME}/jars/aws-java-sdk-1.7.4.jar
In spark-operator inside has hadoop version 2.7 and we need use all dependencies exactly for this version on https://mvnrepository.com/
First for tests I went to inside spark-operator pod by command
kubectl exec -it spark-operator-fb8f779cb-gt657 -n spark-operator -- bash
Then inside my spark-operator pod I go to /opt/spark/jars and upload jars (for example curl https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar)
Then I tried apply my manifest with deps.files and it is worked.