spark-operator Add support to Spark 3.3.0

Spark 3.3.0 was released, I have a custom image of 3.2.1 running with the operator without any issues, and will experiment with 3.3.0 this week, will post the results here.

Jun 28 '22 01:06 josecsotomorales

@josecsotomorales Awesome. Looking forward to the this

Jun 29 '22 09:06 dcoliversun

Hi, It seems that on spark 3.3.0, a validation was added to check that the executor pod name prefix is not more than 47 chars.

We've seen that on scheduled applications, the operator adds a long timestamp + some id before the "exec-id" and then the validation fails the pod creation. For example: "some-application-1657634168668185553-300e8181f2b26a1a-exec-1" results in exception:

java.lang.IllegalArgumentException: 'some-application-1657634168668185553-300e8181f2b26a1a' in spark.kubernetes.executor.podNamePrefix is invalid. must conform https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names and the value length <= 47 at org.apache.spark.internal.config.TypedConfigBuilder.$anonfun$checkValue$1(ConfigBuilder.scala:108) ~[spark-core_2.12-3.3.0.jar:3.3.0] at org.apache.spark.internal.config.TypedConfigBuilder.$anonfun$transform$1(ConfigBuilder.scala:101) ~[spark-core_2.12-3.3.0.jar:3.3.0] at scala.Option.map(Option.scala:230) ~[scala-library-2.12.12.jar:?]

Jul 12 '22 17:07 shlomi-beinhoren

Hi @josecsotomorales! Are you able to use spark 3.3.0 without any issues?

Aug 30 '22 12:08 devender-yadav

@devender-yadav worked all good except for the issue pointed above with scheduled spark apps and pod creation with long names

Aug 30 '22 13:08 josecsotomorales

@josecsotomorales how did you create the spark 3.3.0 image compatibility with gcr.io/spark-operator/spark:v3.1.1?

Aug 30 '22 14:08 devender-yadav

something like this would work, you just need to adjust this dockerfile to add your spark jar app if applicable and required libraries:

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
ARG java_image_tag=11-jre-slim

FROM openjdk:${java_image_tag} as base

# ARG spark_uid=185

# Before building the docker image, first build and make a Spark distribution following
# the instructions in http://spark.apache.org/docs/latest/building-spark.html.
# If this docker file is being used in the context of building your images from a Spark
# distribution, the docker build command should be invoked from the top level directory
# of the Spark distribution. E.g.:
# docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .

RUN set -ex && \
    sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list && \
    apt-get update && \
    ln -s /lib /lib64 && \
    apt install -y bash tini libc6 libpam-modules krb5-user libnss3 wget && \
    rm /bin/sh && \
    ln -sv /bin/bash /bin/sh && \
    echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
    chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \
    rm -rf /var/cache/apt/*

FROM base as spark
### Download Spark Distribution ###
WORKDIR /opt
RUN wget https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.2.tgz
RUN tar xvf spark-3.3.0-bin-hadoop3.2.tgz

FROM spark as build
### Create target directories ###
RUN mkdir -p /opt/spark/jars

### Set Spark dir ARG for use Docker build context on root project dir ###
FROM base as final
ARG spark_dir=/opt/spark-3.3.0-bin-hadoop3.2

WORKDIR /opt/spark/work-dir
ENV SPARK_HOME /opt/spark

WORKDIR /opt/spark/work-dir
RUN chmod g+w /opt/spark/work-dir
RUN chmod a+x /opt/decom.sh

ENTRYPOINT [ "/opt/entrypoint.sh" ]

# Specify the User that the actual main process will run as
# USER ${spark_uid}

Sep 01 '22 16:09 josecsotomorales

hi,

anybody knows if this pod issue can be managed by any spark-operator parameter or name template is hardcoded? https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/api-docs.md

Seems there is a podName property within the DriverSpec but seems that it is not a template. Jus the whole name.

Sep 05 '22 15:09 jmoyano-koa

@josecsotomorales I build the image according to this Dockerfile, and fail in this step. What script is this

chmod: cannot access '/opt/decom.sh': No such file or directory

Sep 21 '22 08:09 jiangjian0920

@jiangjian0920 try this one:

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
ARG java_image_tag=11-jre-slim

FROM openjdk:${java_image_tag} as base

# ARG spark_uid=185

# Before building the docker image, first build and make a Spark distribution following
# the instructions in http://spark.apache.org/docs/latest/building-spark.html.
# If this docker file is being used in the context of building your images from a Spark
# distribution, the docker build command should be invoked from the top level directory
# of the Spark distribution. E.g.:
# docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .

RUN set -ex && \
    sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list && \
    apt-get update && \
    ln -s /lib /lib64 && \
    apt install -y bash tini libc6 libpam-modules krb5-user libnss3 wget && \
    rm /bin/sh && \
    ln -sv /bin/bash /bin/sh && \
    echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
    chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \
    rm -rf /var/cache/apt/*

FROM base as spark
### Download Spark Distribution ###
WORKDIR /opt
RUN wget https://www.apache.org/dyn/closer.lua/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
RUN tar xvf spark-3.3.0-bin-hadoop3.tgz

FROM spark as build
### Create target directories ###
RUN mkdir -p /opt/spark/jars

### Set Spark dir ARG for use Docker build context on root project dir ###
FROM base as final
ARG spark_dir=/opt/spark-3.3.0-bin-hadoop3

### Copy files from the build image ###
COPY --from=build ${spark_dir}/jars /opt/spark/jars
COPY --from=build ${spark_dir}/bin /opt/spark/bin
COPY --from=build ${spark_dir}/sbin /opt/spark/sbin
COPY --from=build ${spark_dir}/kubernetes/dockerfiles/spark/entrypoint.sh /opt/
COPY --from=build ${spark_dir}/kubernetes/dockerfiles/spark/decom.sh /opt/
COPY --from=build ${spark_dir}/examples /opt/spark/examples
COPY --from=build ${spark_dir}/kubernetes/tests /opt/spark/tests
COPY --from=build ${spark_dir}/data /opt/spark/data

WORKDIR /opt/spark/work-dir
ENV SPARK_HOME /opt/spark

WORKDIR /opt/spark/work-dir
RUN chmod g+w /opt/spark/work-dir
RUN chmod a+x /opt/decom.sh

ENTRYPOINT [ "/opt/entrypoint.sh" ]

# Specify the User that the actual main process will run as
# USER ${spark_uid}

Note: You can also use the original one as a base to build your custom image: https://github.com/apache/spark/blob/branch-3.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile

Sep 21 '22 20:09 josecsotomorales

@josecsotomorales thanks for that,I want to build an image that supports version 3.1.2,I'll try again

Sep 22 '22 01:09 jiangjian0920

@jiangjian0920 just use that one I provided and change the spark version, you should be good to go 👍🏻

Sep 22 '22 01:09 josecsotomorales

@josecsotomorales Have you ever tried to change the user of the Spark image? I changed the user of the 2.4.0 version image, and then the following error is reported. Have you ever encountered it?

spark-error.log Exception in thread "main" java.lang.NoClassDefFoundError: io/fabric8/kubernetes/client/KubernetesClient

Oct 17 '22 00:10 jiangjian0920

@shlomi-beinhoren Hi,

I wanna know how to add executor pod name prefix by Spark Operator? https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/api-docs.md#executorspec I didn't find any specification in ExecutorSpec page is related to executor pod name prefix. or did you use saprk native properties? I want to match my driver and executors but get trouble. Thanks!

Nov 02 '22 06:11 harryzhang2016

@shlomi-beinhoren Hi,

I wanna know how to add executor pod name prefix by Spark Operator? https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/api-docs.md#executorspec I didn't find any specification in ExecutorSpec page is related to executor pod name prefix. or did you use saprk native properties? I want to match my driver and executors but get trouble. Thanks!

Hi @harryzhang2016 I did it with spark native properties under the sparkConf section described here - https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#specifying-spark-configuration Adding the property "spark.kubernetes.executor.podNamePrefix"

Nov 02 '22 07:11 shlomi-beinhoren

Spark 3.3.0 testing updates:

I built a custom spark 3.3 image and I'm using it with the latest operator image, no issues so far except for the pod name prefix with no more than 47 chars limitation.

Nov 14 '22 13:11 josecsotomorales

This is the Dockerfile with the custom spark 3.3 image if you folks want to use it:

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
ARG java_image_tag=17-jre

FROM eclipse-temurin:${java_image_tag} as base

# ARG spark_uid=185

# Before building the docker image, first build and make a Spark distribution following
# the instructions in http://spark.apache.org/docs/latest/building-spark.html.
# If this docker file is being used in the context of building your images from a Spark
# distribution, the docker build command should be invoked from the top level directory
# of the Spark distribution. E.g.:
# docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .

RUN set -ex && \
    apt-get update && \
    ln -s /lib /lib64 && \
    apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps net-tools && \
    rm /bin/sh && \
    ln -sv /bin/bash /bin/sh && \
    echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
    chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \
    rm -rf /var/cache/apt/* && rm -rf /var/lib/apt/lists/*

FROM base as spark
### Download Spark Distribution ###
WORKDIR /opt
RUN wget https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
RUN tar xvf spark-3.3.0-bin-hadoop3.tgz

FROM spark as build
### Create target directories ###
RUN mkdir -p /opt/spark/jars

### Set Spark dir ARG for use Docker build context on root project dir ###
FROM base as final
ARG spark_dir=/opt/spark-3.3.0-bin-hadoop3

### Copy files from the build image ###
COPY --from=build ${spark_dir}/jars /opt/spark/jars
COPY --from=build /opt/spark/jars /opt/spark/jars
COPY --from=build ${spark_dir}/bin /opt/spark/bin
COPY --from=build ${spark_dir}/sbin /opt/spark/sbin
COPY --from=build ${spark_dir}/kubernetes/dockerfiles/spark/entrypoint.sh /opt/
COPY --from=build ${spark_dir}/kubernetes/dockerfiles/spark/decom.sh /opt/
COPY --from=build ${spark_dir}/examples /opt/spark/examples
COPY --from=build ${spark_dir}/kubernetes/tests /opt/spark/tests
COPY --from=build ${spark_dir}/data /opt/spark/data

WORKDIR /opt/spark/work-dir
ENV SPARK_HOME /opt/spark

WORKDIR /opt/spark/work-dir
RUN chmod g+w /opt/spark/work-dir
RUN chmod a+x /opt/decom.sh

ENTRYPOINT [ "/opt/entrypoint.sh" ]

# Specify the User that the actual main process will run as
# USER ${spark_uid}

Nov 14 '22 13:11 josecsotomorales

@josecsotomorales What would be the steps to make it work? Just create this image and set this image in the spark application yaml ?

Nov 14 '22 20:11 renanxx1

@renanxx1 that's it :)

Nov 14 '22 21:11 josecsotomorales

@josecsotomorales I used as base the image gcr.io/spark-operator/spark-py:v3.1.1-hadoop3 and I made the image build and used it in the spark yaml. But when I try to run the spark job I'm getting the error :

from pyspark.sql import SparkSession
ModuleNotFoundError: No module named 'pyspark'

What could I be missing?

Nov 14 '22 22:11 renanxx1

Any reason that there isn't an official image with updated spark version for over a year now? Is the spark operator properly maintained?

Dec 08 '22 06:12 assaf-xm

Any reason that there isn't an official image with updated spark version for over a year now? Is the spark operator properly maintained?

As far as I perceived it this repo is not actively maintained by the GoogleCloudPlatform team. I pinged them on LinkedIn once to suggest that they might collaborate with or gift it to the Apache SF because the source code also started to diverge from the upstream apache/spark k8s resource manager from what I've seen in the PR I did half a year ago (still no review on this PR).

Dec 08 '22 07:12 tafaust

I'm considering forking this repo and maintaining it, already asked on the spark operator slack channel what are next steps.

Dec 08 '22 13:12 josecsotomorales

This is the Dockerfile with the custom spark 3.3 image if you folks want to use it:

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
ARG java_image_tag=17-jre

FROM eclipse-temurin:${java_image_tag} as base

# ARG spark_uid=185

# Before building the docker image, first build and make a Spark distribution following
# the instructions in http://spark.apache.org/docs/latest/building-spark.html.
# If this docker file is being used in the context of building your images from a Spark
# distribution, the docker build command should be invoked from the top level directory
# of the Spark distribution. E.g.:
# docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .

RUN set -ex && \
    apt-get update && \
    ln -s /lib /lib64 && \
    apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps net-tools && \
    rm /bin/sh && \
    ln -sv /bin/bash /bin/sh && \
    echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
    chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \
    rm -rf /var/cache/apt/* && rm -rf /var/lib/apt/lists/*

FROM base as spark
### Download Spark Distribution ###
WORKDIR /opt
RUN wget https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
RUN tar xvf spark-3.3.0-bin-hadoop3.tgz

FROM spark as build
### Create target directories ###
RUN mkdir -p /opt/spark/jars

### Set Spark dir ARG for use Docker build context on root project dir ###
FROM base as final
ARG spark_dir=/opt/spark-3.3.0-bin-hadoop3

### Copy files from the build image ###
COPY --from=build ${spark_dir}/jars /opt/spark/jars
COPY --from=build /opt/spark/jars /opt/spark/jars
COPY --from=build ${spark_dir}/bin /opt/spark/bin
COPY --from=build ${spark_dir}/sbin /opt/spark/sbin
COPY --from=build ${spark_dir}/kubernetes/dockerfiles/spark/entrypoint.sh /opt/
COPY --from=build ${spark_dir}/kubernetes/dockerfiles/spark/decom.sh /opt/
COPY --from=build ${spark_dir}/examples /opt/spark/examples
COPY --from=build ${spark_dir}/kubernetes/tests /opt/spark/tests
COPY --from=build ${spark_dir}/data /opt/spark/data

WORKDIR /opt/spark/work-dir
ENV SPARK_HOME /opt/spark

WORKDIR /opt/spark/work-dir
RUN chmod g+w /opt/spark/work-dir
RUN chmod a+x /opt/decom.sh

ENTRYPOINT [ "/opt/entrypoint.sh" ]

# Specify the User that the actual main process will run as
# USER ${spark_uid}

sorry to bother, but after building this image and using in spark operator, the pod keeps restaring with the following message:

"[FATAL tini (16)] exec driver-py failed: No such file or directory" with spark-py:v3.0.0 container image

how could i solve this ? changed spark to 3.3.1

Feb 04 '23 01:02 matheus-rossi

Hey @matheus-rossi , how's it going ? I'm using this one, you can try it.

FROM apache/spark:v3.3.1

USER 185

WORKDIR /

USER 0

RUN apt-get update && apt install -y python3 python3-pip &&  pip3 install --upgrade pip setuptools && rm -r /root/.cache && rm -rf /var/cache/apt/* 

RUN pip3 install pyspark==3.3.1

WORKDIR /opt/spark/work-dir

ENTRYPOINT ["/opt/entrypoint.sh"]

ARG spark_uid=185

USER 185

Feb 04 '23 01:02 renanxx1

Hey @matheus-rossi , how's it going ? I'm using this one, you can try it.

FROM apache/spark:v3.3.1

USER 185

WORKDIR /

USER 0

RUN apt-get update && apt install -y python3 python3-pip &&  pip3 install --upgrade pip setuptools && rm -r /root/.cache && rm -rf /var/cache/apt/* 

RUN pip3 install pyspark==3.3.1

WORKDIR /opt/spark/work-dir

ENTRYPOINT ["/opt/entrypoint.sh"]

ARG spark_uid=185

USER 185

Thanks Renan, that works for me

Feb 09 '23 16:02 matheus-rossi

@josecsotomorales any news on the fork? It is hilarious that Google can't update Spark to even 3.2 after having working code from contributors and a year after 3.2 release...

Feb 14 '23 10:02 j-adamczyk

Also looking for an update on this. @josecsotomorales did you decide to fork?

May 13 '23 15:05 csawtelle

+1 on this issue

May 17 '23 03:05 mrendi29

+1

Oct 22 '23 17:10 andreyolv

spark-operator spark-operator copied to clipboard

Add support to Spark 3.3.0

spark-operator
spark-operator copied to clipboard