official-images
official-images copied to clipboard
Add Apache Spark Docker Official Image
This patch is adding Apache Spark Docker Official Image.
Checklist for Review
NOTE: This checklist is intended for the use of the Official Images maintainers both to track the status of your PR and to help inform you and others of where we're at. As such, please leave the "checking" of items to the repository maintainers. If there is a point below for which you would like to provide additional information or note completion, please do so by commenting on the PR. Thanks! (and thanks for staying patient with us :heart:)
- [x] associated with or contacted upstream? VOTE: passed in Apache Spark Community
- [x] available under an OSI-approved license? Apache License 2.0
- [x] does it fit into one of the common categories? ("service", "language stack", "base distribution") service
- [x] is it reasonably popular, or does it solve a particular use case well? https://spark.apache.org/
- [x] does a documentation PR exist? (should be reviewed and merged at roughly the same time so that we don't have an empty image page on the Hub for very long) https://github.com/docker-library/docs/pull/2200
- [ ] official-images maintainer dockerization review for best practices and cache gotchas/improvements (ala the official review guidelines)?
- [ ] 2+ official-images maintainer dockerization review?
- [x] existing official images have been considered as a base? (ie, if
foobar
needs Node.js, hasFROM node:...
instead of grabbingnode
via other means been considered?)eclipse-temurin
- ~[ ] if
FROM scratch
, tarballs only exist in a single commit within the associated history?~ - [x] passes current tests? any simple new tests that might be appropriate to add? (https://github.com/docker-library/official-images/tree/master/test) Extra integration test in repo
Finally, will move from Yikun/spark-docker
to apache/spark-docker
as suggested in https://github.com/docker-library/official-images/issues/13057 .
Currently, the dockerfile is maintained by apache spark. This dockerfile(https://github.com/Yikun/spark-docker/tree/master/3.3.0/scala2.12-java11-focal) is changed from link.
We have made some improvements as requested by offciail image, but don't know how many gaps there are. Then next step, we will have further discussions in the spark community to make it happen.
So, could you help with a preliminary review on dockerfile and entrypoint? @tianon
Friendly ping @tianon @yosifkit
- associated with or contacted upstream? [DISCUSS] SPIP: Support Docker Official Image for Spark.
- available under an OSI-approved license? Apache 2.0
- does it fit into one of the common categories? serivce
- is it reasonably popular, or does it solve a particular use case well? https://spark.apache.org/
- does a documentation PR exist?https://github.com/docker-library/docs/pull/2200
- existing official images have been considered as a base? Yes,
eclipse-temurin
A quick update (TL;DR: docker official image vote passed in Apache Spark community):
- Sep.2022: The vote for Apache Spark DOI support has been passed in Apache Spark community.
- Oct.2022: The apache/spark-docker repo had been created, this PR is also update to spark official repo
@tianon @yosifkit So, it's ready for review now.
I know it would be slow for new images approved, would you mind giving a rough start time of reviewing? We (The Apache Spark Community) will spare no effort to help DOI support, many thanks!
also cc @HyukjinKwon @zhengruifeng
Quick update (TL;DR: Adding more test and script to ensure official image dockerfiles quality):
- Add Spark on K8s integration test to ensure the docker image quality: https://github.com/apache/spark-docker/pull/9
- Use
template
to generate dockerfiles to avoid low level typo: https://github.com/apache/spark-docker/pull/12 - Apply some suggestion according to https://github.com/docker-library/official-images#review-guidelines
I believe we are ready. @tianon @yosifkit
BTW, review note: https://github.com/apache/spark-docker/tree/master/3.3.0/scala2.12-java11-python3-r-ubuntu This is main dockerfile, left 3 dockerfiles are just part of this.
+1 for an official Docker image for Apache Spark!
Diff for c0667979b3caa153ef19e55d9da3fcc0a498989e:
diff --git a/_bashbrew-cat b/_bashbrew-cat
index bdfae4a..675d31c 100644
--- a/_bashbrew-cat
+++ b/_bashbrew-cat
@@ -1 +1,22 @@
-Maintainers: New Image! :D (@docker-library-bot)
+Maintainers: Apache Spark Developers <[email protected]> (@ApacheSpark)
+GitRepo: https://github.com/apache/spark-docker.git
+
+Tags: 3.3.0-scala2.12-java11-python3-r-ubuntu
+Architectures: amd64, arm64v8
+GitCommit: f488d732d254caa78c1e1a2ef74958e6c867dad6
+Directory: 3.3.0/scala2.12-java11-python3-r-ubuntu
+
+Tags: 3.3.0-scala2.12-java11-python3-ubuntu, 3.3.0-python3, python3, 3.3.0, latest
+Architectures: amd64, arm64v8
+GitCommit: f488d732d254caa78c1e1a2ef74958e6c867dad6
+Directory: 3.3.0/scala2.12-java11-python3-ubuntu
+
+Tags: 3.3.0-scala2.12-java11-r-ubuntu, 3.3.0-r, r
+Architectures: amd64, arm64v8
+GitCommit: f488d732d254caa78c1e1a2ef74958e6c867dad6
+Directory: 3.3.0/scala2.12-java11-r-ubuntu
+
+Tags: 3.3.0-scala2.12-java11-ubuntu, 3.3.0-scala, scala
+Architectures: amd64, arm64v8
+GitCommit: f488d732d254caa78c1e1a2ef74958e6c867dad6
+Directory: 3.3.0/scala2.12-java11-ubuntu
diff --git a/_bashbrew-list b/_bashbrew-list
index e69de29..4a492fe 100644
--- a/_bashbrew-list
+++ b/_bashbrew-list
@@ -0,0 +1,12 @@
+spark:3.3.0
+spark:3.3.0-python3
+spark:3.3.0-r
+spark:3.3.0-scala
+spark:3.3.0-scala2.12-java11-python3-r-ubuntu
+spark:3.3.0-scala2.12-java11-python3-ubuntu
+spark:3.3.0-scala2.12-java11-r-ubuntu
+spark:3.3.0-scala2.12-java11-ubuntu
+spark:latest
+spark:python3
+spark:r
+spark:scala
diff --git a/spark_3.3.0-scala2.12-java11-python3-r-ubuntu/Dockerfile b/spark_3.3.0-scala2.12-java11-python3-r-ubuntu/Dockerfile
new file mode 100644
index 0000000..fb48b80
--- /dev/null
+++ b/spark_3.3.0-scala2.12-java11-python3-r-ubuntu/Dockerfile
@@ -0,0 +1,86 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+FROM eclipse-temurin:11-jre-focal
+
+ARG spark_uid=185
+
+RUN groupadd --system --gid=${spark_uid} spark && \
+ useradd --system --uid=${spark_uid} --gid=spark spark
+
+RUN set -ex && \
+ apt-get update && \
+ ln -s /lib /lib64 && \
+ apt install -y gnupg2 wget bash tini libc6 libpam-modules krb5-user libnss3 procps net-tools gosu && \
+ apt install -y python3 python3-pip && \
+ apt install -y r-base r-base-dev && \
+ mkdir -p /opt/spark && \
+ mkdir /opt/spark/python && \
+ mkdir -p /opt/spark/examples && \
+ mkdir -p /opt/spark/work-dir && \
+ touch /opt/spark/RELEASE && \
+ chown -R spark:spark /opt/spark && \
+ rm /bin/sh && \
+ ln -sv /bin/bash /bin/sh && \
+ echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
+ chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \
+ rm -rf /var/cache/apt/* && \
+ rm -rf /var/lib/apt/lists/*
+
+# Install Apache Spark
+# https://downloads.apache.org/spark/KEYS
+ENV SPARK_TGZ_URL=https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz \
+ SPARK_TGZ_ASC_URL=https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz.asc \
+ GPG_KEY=80FB8EBE8EBA68504989703491B5DC815DBF10D3
+
+RUN set -ex; \
+ export SPARK_TMP="$(mktemp -d)"; \
+ cd $SPARK_TMP; \
+ wget -nv -O spark.tgz "$SPARK_TGZ_URL"; \
+ wget -nv -O spark.tgz.asc "$SPARK_TGZ_ASC_URL"; \
+ export GNUPGHOME="$(mktemp -d)"; \
+ gpg --keyserver hkps://keys.openpgp.org --recv-key "$GPG_KEY" || \
+ gpg --keyserver hkps://keyserver.ubuntu.com --recv-keys "$GPG_KEY"; \
+ gpg --batch --verify spark.tgz.asc spark.tgz; \
+ gpgconf --kill all; \
+ rm -rf "$GNUPGHOME" spark.tgz.asc; \
+ \
+ tar -xf spark.tgz --strip-components=1; \
+ chown -R spark:spark .; \
+ mv jars /opt/spark/; \
+ mv bin /opt/spark/; \
+ mv sbin /opt/spark/; \
+ mv kubernetes/dockerfiles/spark/decom.sh /opt/; \
+ mv examples /opt/spark/; \
+ mv kubernetes/tests /opt/spark/; \
+ mv data /opt/spark/; \
+ mv python/pyspark /opt/spark/python/pyspark/; \
+ mv python/lib /opt/spark/python/lib/; \
+ mv R /opt/spark/; \
+ cd ..; \
+ rm -rf "$SPARK_TMP";
+
+COPY entrypoint.sh /opt/
+
+ENV SPARK_HOME /opt/spark
+ENV R_HOME /usr/lib/R
+
+WORKDIR /opt/spark/work-dir
+RUN chmod g+w /opt/spark/work-dir
+RUN chmod a+x /opt/decom.sh
+RUN chmod a+x /opt/entrypoint.sh
+
+ENTRYPOINT [ "/opt/entrypoint.sh" ]
diff --git a/spark_3.3.0-scala2.12-java11-python3-r-ubuntu/entrypoint.sh b/spark_3.3.0-scala2.12-java11-python3-r-ubuntu/entrypoint.sh
new file mode 100644
index 0000000..4bb1557
--- /dev/null
+++ b/spark_3.3.0-scala2.12-java11-python3-r-ubuntu/entrypoint.sh
@@ -0,0 +1,114 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Check whether there is a passwd entry for the container UID
+myuid=$(id -u)
+mygid=$(id -g)
+# turn off -e for getent because it will return error code in anonymous uid case
+set +e
+uidentry=$(getent passwd $myuid)
+set -e
+
+# If there is no passwd entry for the container UID, attempt to create one
+if [ -z "$uidentry" ] ; then
+ if [ -w /etc/passwd ] ; then
+ echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous uid}:$SPARK_HOME:/bin/false" >> /etc/passwd
+ else
+ echo "Container ENTRYPOINT failed to add passwd entry for anonymous UID"
+ fi
+fi
+
+if [ -z "$JAVA_HOME" ]; then
+ JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 'java.home' | awk '{print $3}')
+fi
+
+SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*"
+env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt
+readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt
+
+if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then
+ SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH"
+fi
+
+if ! [ -z ${PYSPARK_PYTHON+x} ]; then
+ export PYSPARK_PYTHON
+fi
+if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then
+ export PYSPARK_DRIVER_PYTHON
+fi
+
+# If HADOOP_HOME is set and SPARK_DIST_CLASSPATH is not set, set it here so Hadoop jars are available to the executor.
+# It does not set SPARK_DIST_CLASSPATH if already set, to avoid overriding customizations of this value from elsewhere e.g. Docker/K8s.
+if [ -n "${HADOOP_HOME}" ] && [ -z "${SPARK_DIST_CLASSPATH}" ]; then
+ export SPARK_DIST_CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath)"
+fi
+
+if ! [ -z ${HADOOP_CONF_DIR+x} ]; then
+ SPARK_CLASSPATH="$HADOOP_CONF_DIR:$SPARK_CLASSPATH";
+fi
+
+if ! [ -z ${SPARK_CONF_DIR+x} ]; then
+ SPARK_CLASSPATH="$SPARK_CONF_DIR:$SPARK_CLASSPATH";
+elif ! [ -z ${SPARK_HOME+x} ]; then
+ SPARK_CLASSPATH="$SPARK_HOME/conf:$SPARK_CLASSPATH";
+fi
+
+case "$1" in
+ driver)
+ shift 1
+ CMD=(
+ "$SPARK_HOME/bin/spark-submit"
+ --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
+ --deploy-mode client
+ "$@"
+ )
+ ;;
+ executor)
+ shift 1
+ CMD=(
+ ${JAVA_HOME}/bin/java
+ "${SPARK_EXECUTOR_JAVA_OPTS[@]}"
+ -Xms$SPARK_EXECUTOR_MEMORY
+ -Xmx$SPARK_EXECUTOR_MEMORY
+ -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH"
+ org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBackend
+ --driver-url $SPARK_DRIVER_URL
+ --executor-id $SPARK_EXECUTOR_ID
+ --cores $SPARK_EXECUTOR_CORES
+ --app-id $SPARK_APPLICATION_ID
+ --hostname $SPARK_EXECUTOR_POD_IP
+ --resourceProfileId $SPARK_RESOURCE_PROFILE_ID
+ --podName $SPARK_EXECUTOR_POD_NAME
+ )
+ ;;
+
+ *)
+ # Non-spark-on-k8s command provided, proceeding in pass-through mode...
+ CMD=("$@")
+ ;;
+esac
+
+# Switch to spark if no USER specified (root by default) otherwise use USER directly
+switch_spark_if_root() {
+ if [ $(id -u) -eq 0 ]; then
+ echo gosu spark
+ fi
+}
+
+# Execute the container CMD under tini for better hygiene
+exec $(switch_spark_if_root) /usr/bin/tini -s -- "${CMD[@]}"
diff --git a/spark_latest/Dockerfile b/spark_latest/Dockerfile
new file mode 100644
index 0000000..1b6a02c
--- /dev/null
+++ b/spark_latest/Dockerfile
@@ -0,0 +1,83 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+FROM eclipse-temurin:11-jre-focal
+
+ARG spark_uid=185
+
+RUN groupadd --system --gid=${spark_uid} spark && \
+ useradd --system --uid=${spark_uid} --gid=spark spark
+
+RUN set -ex && \
+ apt-get update && \
+ ln -s /lib /lib64 && \
+ apt install -y gnupg2 wget bash tini libc6 libpam-modules krb5-user libnss3 procps net-tools gosu && \
+ apt install -y python3 python3-pip && \
+ mkdir -p /opt/spark && \
+ mkdir /opt/spark/python && \
+ mkdir -p /opt/spark/examples && \
+ mkdir -p /opt/spark/work-dir && \
+ touch /opt/spark/RELEASE && \
+ chown -R spark:spark /opt/spark && \
+ rm /bin/sh && \
+ ln -sv /bin/bash /bin/sh && \
+ echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
+ chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \
+ rm -rf /var/cache/apt/* && \
+ rm -rf /var/lib/apt/lists/*
+
+# Install Apache Spark
+# https://downloads.apache.org/spark/KEYS
+ENV SPARK_TGZ_URL=https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz \
+ SPARK_TGZ_ASC_URL=https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz.asc \
+ GPG_KEY=80FB8EBE8EBA68504989703491B5DC815DBF10D3
+
+RUN set -ex; \
+ export SPARK_TMP="$(mktemp -d)"; \
+ cd $SPARK_TMP; \
+ wget -nv -O spark.tgz "$SPARK_TGZ_URL"; \
+ wget -nv -O spark.tgz.asc "$SPARK_TGZ_ASC_URL"; \
+ export GNUPGHOME="$(mktemp -d)"; \
+ gpg --keyserver hkps://keys.openpgp.org --recv-key "$GPG_KEY" || \
+ gpg --keyserver hkps://keyserver.ubuntu.com --recv-keys "$GPG_KEY"; \
+ gpg --batch --verify spark.tgz.asc spark.tgz; \
+ gpgconf --kill all; \
+ rm -rf "$GNUPGHOME" spark.tgz.asc; \
+ \
+ tar -xf spark.tgz --strip-components=1; \
+ chown -R spark:spark .; \
+ mv jars /opt/spark/; \
+ mv bin /opt/spark/; \
+ mv sbin /opt/spark/; \
+ mv kubernetes/dockerfiles/spark/decom.sh /opt/; \
+ mv examples /opt/spark/; \
+ mv kubernetes/tests /opt/spark/; \
+ mv data /opt/spark/; \
+ mv python/pyspark /opt/spark/python/pyspark/; \
+ mv python/lib /opt/spark/python/lib/; \
+ cd ..; \
+ rm -rf "$SPARK_TMP";
+
+COPY entrypoint.sh /opt/
+
+ENV SPARK_HOME /opt/spark
+
+WORKDIR /opt/spark/work-dir
+RUN chmod g+w /opt/spark/work-dir
+RUN chmod a+x /opt/decom.sh
+RUN chmod a+x /opt/entrypoint.sh
+
+ENTRYPOINT [ "/opt/entrypoint.sh" ]
diff --git a/spark_latest/entrypoint.sh b/spark_latest/entrypoint.sh
new file mode 100644
index 0000000..4bb1557
--- /dev/null
+++ b/spark_latest/entrypoint.sh
@@ -0,0 +1,114 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Check whether there is a passwd entry for the container UID
+myuid=$(id -u)
+mygid=$(id -g)
+# turn off -e for getent because it will return error code in anonymous uid case
+set +e
+uidentry=$(getent passwd $myuid)
+set -e
+
+# If there is no passwd entry for the container UID, attempt to create one
+if [ -z "$uidentry" ] ; then
+ if [ -w /etc/passwd ] ; then
+ echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous uid}:$SPARK_HOME:/bin/false" >> /etc/passwd
+ else
+ echo "Container ENTRYPOINT failed to add passwd entry for anonymous UID"
+ fi
+fi
+
+if [ -z "$JAVA_HOME" ]; then
+ JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 'java.home' | awk '{print $3}')
+fi
+
+SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*"
+env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt
+readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt
+
+if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then
+ SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH"
+fi
+
+if ! [ -z ${PYSPARK_PYTHON+x} ]; then
+ export PYSPARK_PYTHON
+fi
+if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then
+ export PYSPARK_DRIVER_PYTHON
+fi
+
+# If HADOOP_HOME is set and SPARK_DIST_CLASSPATH is not set, set it here so Hadoop jars are available to the executor.
+# It does not set SPARK_DIST_CLASSPATH if already set, to avoid overriding customizations of this value from elsewhere e.g. Docker/K8s.
+if [ -n "${HADOOP_HOME}" ] && [ -z "${SPARK_DIST_CLASSPATH}" ]; then
+ export SPARK_DIST_CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath)"
+fi
+
+if ! [ -z ${HADOOP_CONF_DIR+x} ]; then
+ SPARK_CLASSPATH="$HADOOP_CONF_DIR:$SPARK_CLASSPATH";
+fi
+
+if ! [ -z ${SPARK_CONF_DIR+x} ]; then
+ SPARK_CLASSPATH="$SPARK_CONF_DIR:$SPARK_CLASSPATH";
+elif ! [ -z ${SPARK_HOME+x} ]; then
+ SPARK_CLASSPATH="$SPARK_HOME/conf:$SPARK_CLASSPATH";
+fi
+
+case "$1" in
+ driver)
+ shift 1
+ CMD=(
+ "$SPARK_HOME/bin/spark-submit"
+ --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
+ --deploy-mode client
+ "$@"
+ )
+ ;;
+ executor)
+ shift 1
+ CMD=(
+ ${JAVA_HOME}/bin/java
+ "${SPARK_EXECUTOR_JAVA_OPTS[@]}"
+ -Xms$SPARK_EXECUTOR_MEMORY
+ -Xmx$SPARK_EXECUTOR_MEMORY
+ -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH"
+ org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBackend
+ --driver-url $SPARK_DRIVER_URL
+ --executor-id $SPARK_EXECUTOR_ID
+ --cores $SPARK_EXECUTOR_CORES
+ --app-id $SPARK_APPLICATION_ID
+ --hostname $SPARK_EXECUTOR_POD_IP
+ --resourceProfileId $SPARK_RESOURCE_PROFILE_ID
+ --podName $SPARK_EXECUTOR_POD_NAME
+ )
+ ;;
+
+ *)
+ # Non-spark-on-k8s command provided, proceeding in pass-through mode...
+ CMD=("$@")
+ ;;
+esac
+
+# Switch to spark if no USER specified (root by default) otherwise use USER directly
+switch_spark_if_root() {
+ if [ $(id -u) -eq 0 ]; then
+ echo gosu spark
+ fi
+}
+
+# Execute the container CMD under tini for better hygiene
+exec $(switch_spark_if_root) /usr/bin/tini -s -- "${CMD[@]}"
diff --git a/spark_r/Dockerfile b/spark_r/Dockerfile
new file mode 100644
index 0000000..0b49aef
--- /dev/null
+++ b/spark_r/Dockerfile
@@ -0,0 +1,82 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+FROM eclipse-temurin:11-jre-focal
+
+ARG spark_uid=185
+
+RUN groupadd --system --gid=${spark_uid} spark && \
+ useradd --system --uid=${spark_uid} --gid=spark spark
+
+RUN set -ex && \
+ apt-get update && \
+ ln -s /lib /lib64 && \
+ apt install -y gnupg2 wget bash tini libc6 libpam-modules krb5-user libnss3 procps net-tools gosu && \
+ apt install -y r-base r-base-dev && \
+ mkdir -p /opt/spark && \
+ mkdir -p /opt/spark/examples && \
+ mkdir -p /opt/spark/work-dir && \
+ touch /opt/spark/RELEASE && \
+ chown -R spark:spark /opt/spark && \
+ rm /bin/sh && \
+ ln -sv /bin/bash /bin/sh && \
+ echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
+ chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \
+ rm -rf /var/cache/apt/* && \
+ rm -rf /var/lib/apt/lists/*
+
+# Install Apache Spark
+# https://downloads.apache.org/spark/KEYS
+ENV SPARK_TGZ_URL=https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz \
+ SPARK_TGZ_ASC_URL=https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz.asc \
+ GPG_KEY=80FB8EBE8EBA68504989703491B5DC815DBF10D3
+
+RUN set -ex; \
+ export SPARK_TMP="$(mktemp -d)"; \
+ cd $SPARK_TMP; \
+ wget -nv -O spark.tgz "$SPARK_TGZ_URL"; \
+ wget -nv -O spark.tgz.asc "$SPARK_TGZ_ASC_URL"; \
+ export GNUPGHOME="$(mktemp -d)"; \
+ gpg --keyserver hkps://keys.openpgp.org --recv-key "$GPG_KEY" || \
+ gpg --keyserver hkps://keyserver.ubuntu.com --recv-keys "$GPG_KEY"; \
+ gpg --batch --verify spark.tgz.asc spark.tgz; \
+ gpgconf --kill all; \
+ rm -rf "$GNUPGHOME" spark.tgz.asc; \
+ \
+ tar -xf spark.tgz --strip-components=1; \
+ chown -R spark:spark .; \
+ mv jars /opt/spark/; \
+ mv bin /opt/spark/; \
+ mv sbin /opt/spark/; \
+ mv kubernetes/dockerfiles/spark/decom.sh /opt/; \
+ mv examples /opt/spark/; \
+ mv kubernetes/tests /opt/spark/; \
+ mv data /opt/spark/; \
+ mv R /opt/spark/; \
+ cd ..; \
+ rm -rf "$SPARK_TMP";
+
+COPY entrypoint.sh /opt/
+
+ENV SPARK_HOME /opt/spark
+ENV R_HOME /usr/lib/R
+
+WORKDIR /opt/spark/work-dir
+RUN chmod g+w /opt/spark/work-dir
+RUN chmod a+x /opt/decom.sh
+RUN chmod a+x /opt/entrypoint.sh
+
+ENTRYPOINT [ "/opt/entrypoint.sh" ]
diff --git a/spark_r/entrypoint.sh b/spark_r/entrypoint.sh
new file mode 100644
index 0000000..4bb1557
--- /dev/null
+++ b/spark_r/entrypoint.sh
@@ -0,0 +1,114 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Check whether there is a passwd entry for the container UID
+myuid=$(id -u)
+mygid=$(id -g)
+# turn off -e for getent because it will return error code in anonymous uid case
+set +e
+uidentry=$(getent passwd $myuid)
+set -e
+
+# If there is no passwd entry for the container UID, attempt to create one
+if [ -z "$uidentry" ] ; then
+ if [ -w /etc/passwd ] ; then
+ echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous uid}:$SPARK_HOME:/bin/false" >> /etc/passwd
+ else
+ echo "Container ENTRYPOINT failed to add passwd entry for anonymous UID"
+ fi
+fi
+
+if [ -z "$JAVA_HOME" ]; then
+ JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 'java.home' | awk '{print $3}')
+fi
+
+SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*"
+env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt
+readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt
+
+if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then
+ SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH"
+fi
+
+if ! [ -z ${PYSPARK_PYTHON+x} ]; then
+ export PYSPARK_PYTHON
+fi
+if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then
+ export PYSPARK_DRIVER_PYTHON
+fi
+
+# If HADOOP_HOME is set and SPARK_DIST_CLASSPATH is not set, set it here so Hadoop jars are available to the executor.
+# It does not set SPARK_DIST_CLASSPATH if already set, to avoid overriding customizations of this value from elsewhere e.g. Docker/K8s.
+if [ -n "${HADOOP_HOME}" ] && [ -z "${SPARK_DIST_CLASSPATH}" ]; then
+ export SPARK_DIST_CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath)"
+fi
+
+if ! [ -z ${HADOOP_CONF_DIR+x} ]; then
+ SPARK_CLASSPATH="$HADOOP_CONF_DIR:$SPARK_CLASSPATH";
+fi
+
+if ! [ -z ${SPARK_CONF_DIR+x} ]; then
+ SPARK_CLASSPATH="$SPARK_CONF_DIR:$SPARK_CLASSPATH";
+elif ! [ -z ${SPARK_HOME+x} ]; then
+ SPARK_CLASSPATH="$SPARK_HOME/conf:$SPARK_CLASSPATH";
+fi
+
+case "$1" in
+ driver)
+ shift 1
+ CMD=(
+ "$SPARK_HOME/bin/spark-submit"
+ --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
+ --deploy-mode client
+ "$@"
+ )
+ ;;
+ executor)
+ shift 1
+ CMD=(
+ ${JAVA_HOME}/bin/java
+ "${SPARK_EXECUTOR_JAVA_OPTS[@]}"
+ -Xms$SPARK_EXECUTOR_MEMORY
+ -Xmx$SPARK_EXECUTOR_MEMORY
+ -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH"
+ org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBackend
+ --driver-url $SPARK_DRIVER_URL
+ --executor-id $SPARK_EXECUTOR_ID
+ --cores $SPARK_EXECUTOR_CORES
+ --app-id $SPARK_APPLICATION_ID
+ --hostname $SPARK_EXECUTOR_POD_IP
+ --resourceProfileId $SPARK_RESOURCE_PROFILE_ID
+ --podName $SPARK_EXECUTOR_POD_NAME
+ )
+ ;;
+
+ *)
+ # Non-spark-on-k8s command provided, proceeding in pass-through mode...
+ CMD=("$@")
+ ;;
+esac
+
+# Switch to spark if no USER specified (root by default) otherwise use USER directly
+switch_spark_if_root() {
+ if [ $(id -u) -eq 0 ]; then
+ echo gosu spark
+ fi
+}
+
+# Execute the container CMD under tini for better hygiene
+exec $(switch_spark_if_root) /usr/bin/tini -s -- "${CMD[@]}"
diff --git a/spark_scala/Dockerfile b/spark_scala/Dockerfile
new file mode 100644
index 0000000..cbdf6c5
--- /dev/null
+++ b/spark_scala/Dockerfile
@@ -0,0 +1,79 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+FROM eclipse-temurin:11-jre-focal
+
+ARG spark_uid=185
+
+RUN groupadd --system --gid=${spark_uid} spark && \
+ useradd --system --uid=${spark_uid} --gid=spark spark
+
+RUN set -ex && \
+ apt-get update && \
+ ln -s /lib /lib64 && \
+ apt install -y gnupg2 wget bash tini libc6 libpam-modules krb5-user libnss3 procps net-tools gosu && \
+ mkdir -p /opt/spark && \
+ mkdir -p /opt/spark/examples && \
+ mkdir -p /opt/spark/work-dir && \
+ touch /opt/spark/RELEASE && \
+ chown -R spark:spark /opt/spark && \
+ rm /bin/sh && \
+ ln -sv /bin/bash /bin/sh && \
+ echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
+ chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \
+ rm -rf /var/cache/apt/* && \
+ rm -rf /var/lib/apt/lists/*
+
+# Install Apache Spark
+# https://downloads.apache.org/spark/KEYS
+ENV SPARK_TGZ_URL=https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz \
+ SPARK_TGZ_ASC_URL=https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz.asc \
+ GPG_KEY=80FB8EBE8EBA68504989703491B5DC815DBF10D3
+
+RUN set -ex; \
+ export SPARK_TMP="$(mktemp -d)"; \
+ cd $SPARK_TMP; \
+ wget -nv -O spark.tgz "$SPARK_TGZ_URL"; \
+ wget -nv -O spark.tgz.asc "$SPARK_TGZ_ASC_URL"; \
+ export GNUPGHOME="$(mktemp -d)"; \
+ gpg --keyserver hkps://keys.openpgp.org --recv-key "$GPG_KEY" || \
+ gpg --keyserver hkps://keyserver.ubuntu.com --recv-keys "$GPG_KEY"; \
+ gpg --batch --verify spark.tgz.asc spark.tgz; \
+ gpgconf --kill all; \
+ rm -rf "$GNUPGHOME" spark.tgz.asc; \
+ \
+ tar -xf spark.tgz --strip-components=1; \
+ chown -R spark:spark .; \
+ mv jars /opt/spark/; \
+ mv bin /opt/spark/; \
+ mv sbin /opt/spark/; \
+ mv kubernetes/dockerfiles/spark/decom.sh /opt/; \
+ mv examples /opt/spark/; \
+ mv kubernetes/tests /opt/spark/; \
+ mv data /opt/spark/; \
+ cd ..; \
+ rm -rf "$SPARK_TMP";
+
+COPY entrypoint.sh /opt/
+
+ENV SPARK_HOME /opt/spark
+
+WORKDIR /opt/spark/work-dir
+RUN chmod g+w /opt/spark/work-dir
+RUN chmod a+x /opt/decom.sh
+RUN chmod a+x /opt/entrypoint.sh
+
+ENTRYPOINT [ "/opt/entrypoint.sh" ]
diff --git a/spark_scala/entrypoint.sh b/spark_scala/entrypoint.sh
new file mode 100644
index 0000000..4bb1557
--- /dev/null
+++ b/spark_scala/entrypoint.sh
@@ -0,0 +1,114 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Check whether there is a passwd entry for the container UID
+myuid=$(id -u)
+mygid=$(id -g)
+# turn off -e for getent because it will return error code in anonymous uid case
+set +e
+uidentry=$(getent passwd $myuid)
+set -e
+
+# If there is no passwd entry for the container UID, attempt to create one
+if [ -z "$uidentry" ] ; then
+ if [ -w /etc/passwd ] ; then
+ echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous uid}:$SPARK_HOME:/bin/false" >> /etc/passwd
+ else
+ echo "Container ENTRYPOINT failed to add passwd entry for anonymous UID"
+ fi
+fi
+
+if [ -z "$JAVA_HOME" ]; then
+ JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 'java.home' | awk '{print $3}')
+fi
+
+SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*"
+env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt
+readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt
+
+if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then
+ SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH"
+fi
+
+if ! [ -z ${PYSPARK_PYTHON+x} ]; then
+ export PYSPARK_PYTHON
+fi
+if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then
+ export PYSPARK_DRIVER_PYTHON
+fi
+
+# If HADOOP_HOME is set and SPARK_DIST_CLASSPATH is not set, set it here so Hadoop jars are available to the executor.
+# It does not set SPARK_DIST_CLASSPATH if already set, to avoid overriding customizations of this value from elsewhere e.g. Docker/K8s.
+if [ -n "${HADOOP_HOME}" ] && [ -z "${SPARK_DIST_CLASSPATH}" ]; then
+ export SPARK_DIST_CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath)"
+fi
+
+if ! [ -z ${HADOOP_CONF_DIR+x} ]; then
+ SPARK_CLASSPATH="$HADOOP_CONF_DIR:$SPARK_CLASSPATH";
+fi
+
+if ! [ -z ${SPARK_CONF_DIR+x} ]; then
+ SPARK_CLASSPATH="$SPARK_CONF_DIR:$SPARK_CLASSPATH";
+elif ! [ -z ${SPARK_HOME+x} ]; then
+ SPARK_CLASSPATH="$SPARK_HOME/conf:$SPARK_CLASSPATH";
+fi
+
+case "$1" in
+ driver)
+ shift 1
+ CMD=(
+ "$SPARK_HOME/bin/spark-submit"
+ --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
+ --deploy-mode client
+ "$@"
+ )
+ ;;
+ executor)
+ shift 1
+ CMD=(
+ ${JAVA_HOME}/bin/java
+ "${SPARK_EXECUTOR_JAVA_OPTS[@]}"
+ -Xms$SPARK_EXECUTOR_MEMORY
+ -Xmx$SPARK_EXECUTOR_MEMORY
+ -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH"
+ org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBackend
+ --driver-url $SPARK_DRIVER_URL
+ --executor-id $SPARK_EXECUTOR_ID
+ --cores $SPARK_EXECUTOR_CORES
+ --app-id $SPARK_APPLICATION_ID
+ --hostname $SPARK_EXECUTOR_POD_IP
+ --resourceProfileId $SPARK_RESOURCE_PROFILE_ID
+ --podName $SPARK_EXECUTOR_POD_NAME
+ )
+ ;;
+
+ *)
+ # Non-spark-on-k8s command provided, proceeding in pass-through mode...
+ CMD=("$@")
+ ;;
+esac
+
+# Switch to spark if no USER specified (root by default) otherwise use USER directly
+switch_spark_if_root() {
+ if [ $(id -u) -eq 0 ]; then
+ echo gosu spark
+ fi
+}
+
+# Execute the container CMD under tini for better hygiene
+exec $(switch_spark_if_root) /usr/bin/tini -s -- "${CMD[@]}"
Any ETA when this could be reviewed and merged ? An official image for Apache Spark is much needed!
Hello! :sparkles:
Thanks for your interest in contributing to the official images program. :thought_balloon:
As you may have noticed, we've usually got a pretty decently sized queue of new images (not to mention image updates and maintenance of images under @docker-library which are maintained by the core official images team). As such, it may be some time before we get to reviewing this image (image updates get priority both because users expect them and because reviewing new images is a more involved process than reviewing updates), so we apologize in advance! Please be patient with us -- rest assured, we've seen your PR and it's in the queue. :heart:
We do try to proactively add and update the "new image checklist" on each PR, so if you haven't looked at it yet, that's a good use of time while you wait. :umbrella:
Thanks! :sparkling_heart: :blue_heart: :green_heart: :heart:
@yosifkit Thank you very much for your reply! We (Apache Spark Community) will also actively address comments once getting review! Looking forward to your review!
@yosifkit Today, we just published the latest release spark 3.4 https://spark.apache.org/releases/spark-release-3-4-0.html I am wondering when we can have our own official docker image?
Ok, I finally have some feedback ready🎉😻. Sorry for the delay 🙇 Let me know if you have any questions. And thank you for your patience 🙇🥰💖
-
Would it be useful to save space by sharing layers by having one image from another? 🤔 Something like the
*java11-ubuntu
as the "base" withr
andpython
variantsFROM
that and ther-python
beingFROM
, probably, the larger one of those?- Rough example Dockerfiles
FROM eclipse-temurin:11-jre-focal # user stuff, install common deps, etc ... # download/extract spark (maybe keeping python and R files too? they seem relatively small compared to the rest)
# other images in separate Dockerfiles FROM spark:3.3.0-scala2.12-java11-ubuntu # get "/opt/spark/{python,R}/" contents if not kept in base # install python or R (and things like R_HOME)
- Rough example Dockerfiles
rm /bin/sh && ln -sv /bin/bash /bin/sh
- Does spark expect
/bin/sh
to be bash (in posix mode) and not just a generic posix compliant shell? Is there a bug with usingdash
? Minimally, this should do adpkg --divert
to ensure that users in dependent images don't accidentally revert it with package updates.
echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su &&
- I am unsure what this is for? 😕 As far as I can tell, this means that only members of the administrative group
wheel
(or0
if there is no wheel) can switch to another user using the su command. That might make sense on a regular multi-user system, but I am unsure why it would matter for a container.
chgrp root /etc/passwd && chmod ug+rw /etc/passwd
- Wider permissions on
/etc/passwd
is concerning. What use case is broken if the running user id doesn't exist?
echo ... >> /etc/passwd
- Having the entrypoint itself modify
/etc/passwd
is fragile. Are there features that are broken if the user doesn't exist in/etc/passwd
(like PostgreSQL'sinitdb
that refuses to run)? Minimally, this should probably useuseradd
andusermod
rather than hand editing.
env | grep SPARK_JAVA_OPT_ |...
- This is susceptible to a few bugs particularly around newlines in values. I see two ways around it.
- ensuring the matching name is at the start,
-E '^SPARK_JAVA_OPT_'
, and running all the commands with null-terminated input and output:env -0
and-z
on the other commands - or bash variable prefix expansion
for v in "${!SPARK_JAVA_OPT_@}"; do SPARK_EXECUTOR_JAVA_OPTS+=( "${!v}" ) done
- ensuring the matching name is at the start,
-
switch_spark_if_root
andtini
should only happen when running a sparkdriver
orexecutor
and not for something likedocker run -it spark bash
. -
To minimize duplication across layers,
chmod
's should be done in the layer that creates the file/folder (or in the case of a file from the context viaCOPY
, it should have the+x
committed to git)
non-blocking:
- using
set -ex
means you can use;
instead of&&
(really only matters for complex expressions, like the||
in the laterRUN
that does use;
)
@yosifkit Thanks for your reply! We will address all comments as soon as possible!
- Would it be useful to save space by sharing layers by having one image from another? 🤔 Something like the *java11-ubuntu as the "base" with r and python variants FROM that and the r-python being FROM, probably, the larger one of those?
Good suggestion
- should we put the base image and r, python image into separate PR? (first base and left python and r, because after change that, r, python depends on base).
- We are also consider support java17. Just wanna sure java11 / java17 should be two base image, right?
Fixing in: https://github.com/apache/spark-docker/pull/36
- Does spark expect /bin/sh to be bash (in posix mode) and not just a generic posix compliant shell? Is there a bug with using dash? Minimally, this should do a dpkg --divert to ensure that users in dependent images don't accidentally revert it with package updates.
It was introduced with question 6
, https://github.com/apache-spark-on-k8s/spark/pull/444/files#r134075892 , we will try to fix it after question 6 addressed.
6 This is susceptible to a few bugs particularly around newlines in values. I see two ways around it.
export SPARK_JAVA_OPT_0="foo=bar"
export SPARK_JAVA_OPT_1="foo1=bar1"
env -0 | grep -z -E "^SPARK_JAVA_OPT_" | sort -z -t_ -k4 -n | sed -z 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt
# readarray seems not support `\0`, even if use the `-d '\0'`, it's also not working
readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt
for v in ${SPARK_EXECUTOR_JAVA_OPTS[@]}; do
echo $v
done
# foo=bar
For bash variable prefix expansion
, it works but can it also support sh
? (I try and it works, but I'm not sure it should the best way or not)?
- As far as I can tell, this means that only members of the administrative group wheel (or 0 if there is no wheel) can switch to another user using the su command. That might make sense on a regular multi-user system, but I am unsure why it would matter for a container.
I found it was introduced in SPARK-25275 for security improvement. It should also for OpenShift, so if it's not a blocker, we prefer to keep.
Question 4 Wider permissions on /etc/passwd is concerning. What use case is broken if the running user id doesn't exist? Question 5 Having the entrypoint itself modify /etc/passwd is fragile
It was introduced for OpenShift, https://github.com/apache-spark-on-k8s/spark/pull/404 . ~But I'm not sure it's still a problem after 6 years? ~cc @erikerlandson If it's not a blcoker, we prefer to keep.
BTW, a case might related, would mind share some idea about how we would switch username in entrypoint.sh like https://github.com/apache/spark/pull/40831 to meet the spark on k8s user switch requirement? such as something like usemod -l
is it allows in DOI? (how to change the username when we already useradd in dockerfile)
switch_spark_if_root and tini should only happen when running a spark driver or executor and not for something like docker run -it spark bash.
Actually, I tried others apache offcial image like solr / flink:
# docker run -ti solr bash
solr@0d356d1b66f7:/opt/solr-9.2.1$
# docker run -ti flink bash
flink@89bf6453d82e:~$
It use the specific username, did I missed something?
question 8, 9
We will address soon!
- https://github.com/apache/spark-docker/pull/38
- https://github.com/apache/spark-docker/pull/37
Regarding the modification of /etc/passwd
- OpenShift runs its pods with anonymous random uid
- and so you cannot assume any particular uid
in the image. And there will be no such entry in the passwd file, which Spark used to require to run.
Does Spark still require an entry to exist in /etc/passwd
? If that is no longer required, you may be able to remove this logic. Otherwise, it would still be needed for OpenShift.
see: https://cloud.redhat.com/blog/a-guide-to-openshift-and-uids
@erikerlandson Really much thanks for your reply, we added useradd
so seems not a problem now?
https://github.com/apache/spark-docker/blob/fe05e38f0ffad271edccd6ae40a77d5f14f3eef7/3.4.0/scala2.12-java11-ubuntu/Dockerfile#L21
And any more background about question 3 (/etc/pam.d/su
) SPARK-25275, is it also for OpenShift?
@Yikun I don't think that will work. It adds an entry for uid=185
but the uid will be different when the image is run in the pod, and you will never be able to predict what the uid is before hand.
Regarding SPARK-25275, I no longer recall what the underlying issue was. Unless you can run CI testing on OpenShift, I'd recommend you leave these things in.
Diff for 124747344c8db41387d57ea898b3626b8641f849:
diff --git a/_bashbrew-arches b/_bashbrew-arches
index 8b13789..e85a97f 100644
--- a/_bashbrew-arches
+++ b/_bashbrew-arches
@@ -1 +1,2 @@
-
+amd64
+arm64v8
diff --git a/_bashbrew-cat b/_bashbrew-cat
index bdfae4a..2c6e990 100644
--- a/_bashbrew-cat
+++ b/_bashbrew-cat
@@ -1 +1,22 @@
-Maintainers: New Image! :D (@docker-library-bot)
+Maintainers: Apache Spark Developers <[email protected]> (@ApacheSpark)
+GitRepo: https://github.com/apache/spark-docker.git
+
+Tags: 3.4.0-scala2.12-java11-python3-r-ubuntu
+Architectures: amd64, arm64v8
+GitCommit: 7f9b414de48639d69c64acfd81e6792517b86f61
+Directory: 3.4.0/scala2.12-java11-python3-r-ubuntu
+
+Tags: 3.4.0-scala2.12-java11-python3-ubuntu, 3.4.0-python3, python3, 3.4.0, latest
+Architectures: amd64, arm64v8
+GitCommit: 7f9b414de48639d69c64acfd81e6792517b86f61
+Directory: 3.4.0/scala2.12-java11-python3-ubuntu
+
+Tags: 3.4.0-scala2.12-java11-r-ubuntu, 3.4.0-r, r
+Architectures: amd64, arm64v8
+GitCommit: 7f9b414de48639d69c64acfd81e6792517b86f61
+Directory: 3.4.0/scala2.12-java11-r-ubuntu
+
+Tags: 3.4.0-scala2.12-java11-ubuntu, 3.4.0-scala, scala
+Architectures: amd64, arm64v8
+GitCommit: 7f9b414de48639d69c64acfd81e6792517b86f61
+Directory: 3.4.0/scala2.12-java11-ubuntu
diff --git a/_bashbrew-list b/_bashbrew-list
index e69de29..f72d637 100644
--- a/_bashbrew-list
+++ b/_bashbrew-list
@@ -0,0 +1,12 @@
+spark:3.4.0
+spark:3.4.0-python3
+spark:3.4.0-r
+spark:3.4.0-scala
+spark:3.4.0-scala2.12-java11-python3-r-ubuntu
+spark:3.4.0-scala2.12-java11-python3-ubuntu
+spark:3.4.0-scala2.12-java11-r-ubuntu
+spark:3.4.0-scala2.12-java11-ubuntu
+spark:latest
+spark:python3
+spark:r
+spark:scala
diff --git a/_bashbrew-list-build-order b/_bashbrew-list-build-order
index e69de29..66d8919 100644
--- a/_bashbrew-list-build-order
+++ b/_bashbrew-list-build-order
@@ -0,0 +1,4 @@
+spark:3.4.0-scala2.12-java11-python3-r-ubuntu
+spark:latest
+spark:r
+spark:scala
diff --git a/spark_3.4.0-scala2.12-java11-python3-r-ubuntu/Dockerfile b/spark_3.4.0-scala2.12-java11-python3-r-ubuntu/Dockerfile
new file mode 100644
index 0000000..12c7a4f
--- /dev/null
+++ b/spark_3.4.0-scala2.12-java11-python3-r-ubuntu/Dockerfile
@@ -0,0 +1,27 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+ARG BASE_IMAGE=spark:3.4.0-scala2.12-java11-ubuntu
+FROM $BASE_IMAGE
+
+RUN set -ex; \
+ apt-get update; \
+ apt install -y python3 python3-pip; \
+ apt install -y r-base r-base-dev; \
+ rm -rf /var/cache/apt/*; \
+ rm -rf /var/lib/apt/lists/*
+
+ENV R_HOME /usr/lib/R
diff --git a/spark_latest/Dockerfile b/spark_latest/Dockerfile
new file mode 100644
index 0000000..1f0dd1f
--- /dev/null
+++ b/spark_latest/Dockerfile
@@ -0,0 +1,24 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+ARG BASE_IMAGE=spark:3.4.0-scala2.12-java11-ubuntu
+FROM $BASE_IMAGE
+
+RUN set -ex; \
+ apt-get update; \
+ apt install -y python3 python3-pip; \
+ rm -rf /var/cache/apt/*; \
+ rm -rf /var/lib/apt/lists/*
diff --git a/spark_r/Dockerfile b/spark_r/Dockerfile
new file mode 100644
index 0000000..53647b2
--- /dev/null
+++ b/spark_r/Dockerfile
@@ -0,0 +1,26 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+ARG BASE_IMAGE=spark:3.4.0-scala2.12-java11-ubuntu
+FROM $BASE_IMAGE
+
+RUN set -ex; \
+ apt-get update; \
+ apt install -y r-base r-base-dev; \
+ rm -rf /var/cache/apt/*; \
+ rm -rf /var/lib/apt/lists/*
+
+ENV R_HOME /usr/lib/R
diff --git a/spark_scala/Dockerfile b/spark_scala/Dockerfile
new file mode 100644
index 0000000..11f997f
--- /dev/null
+++ b/spark_scala/Dockerfile
@@ -0,0 +1,82 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+FROM eclipse-temurin:11-jre-focal
+
+ARG spark_uid=185
+
+RUN groupadd --system --gid=${spark_uid} spark && \
+ useradd --system --uid=${spark_uid} --gid=spark spark
+
+RUN set -ex; \
+ apt-get update; \
+ ln -s /lib /lib64; \
+ apt install -y gnupg2 wget bash tini libc6 libpam-modules krb5-user libnss3 procps net-tools gosu; \
+ mkdir -p /opt/spark; \
+ mkdir /opt/spark/python; \
+ mkdir -p /opt/spark/examples; \
+ mkdir -p /opt/spark/work-dir; \
+ chmod g+w /opt/spark/work-dir; \
+ touch /opt/spark/RELEASE; \
+ chown -R spark:spark /opt/spark; \
+ rm /bin/sh; \
+ ln -sv /bin/bash /bin/sh; \
+ echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su; \
+ chgrp root /etc/passwd && chmod ug+rw /etc/passwd; \
+ rm -rf /var/cache/apt/*; \
+ rm -rf /var/lib/apt/lists/*
+
+# Install Apache Spark
+# https://downloads.apache.org/spark/KEYS
+ENV SPARK_TGZ_URL=https://archive.apache.org/dist/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz \
+ SPARK_TGZ_ASC_URL=https://archive.apache.org/dist/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz.asc \
+ GPG_KEY=CC68B3D16FE33A766705160BA7E57908C7A4E1B1
+
+RUN set -ex; \
+ export SPARK_TMP="$(mktemp -d)"; \
+ cd $SPARK_TMP; \
+ wget -nv -O spark.tgz "$SPARK_TGZ_URL"; \
+ wget -nv -O spark.tgz.asc "$SPARK_TGZ_ASC_URL"; \
+ export GNUPGHOME="$(mktemp -d)"; \
+ gpg --keyserver hkps://keys.openpgp.org --recv-key "$GPG_KEY" || \
+ gpg --keyserver hkps://keyserver.ubuntu.com --recv-keys "$GPG_KEY"; \
+ gpg --batch --verify spark.tgz.asc spark.tgz; \
+ gpgconf --kill all; \
+ rm -rf "$GNUPGHOME" spark.tgz.asc; \
+ \
+ tar -xf spark.tgz --strip-components=1; \
+ chown -R spark:spark .; \
+ mv jars /opt/spark/; \
+ mv bin /opt/spark/; \
+ mv sbin /opt/spark/; \
+ mv kubernetes/dockerfiles/spark/decom.sh /opt/; \
+ mv examples /opt/spark/; \
+ mv kubernetes/tests /opt/spark/; \
+ mv data /opt/spark/; \
+ mv python/pyspark /opt/spark/python/pyspark/; \
+ mv python/lib /opt/spark/python/lib/; \
+ mv R /opt/spark/; \
+ chmod a+x /opt/decom.sh; \
+ cd ..; \
+ rm -rf "$SPARK_TMP";
+
+COPY entrypoint.sh /opt/
+
+ENV SPARK_HOME /opt/spark
+
+WORKDIR /opt/spark/work-dir
+
+ENTRYPOINT [ "/opt/entrypoint.sh" ]
diff --git a/spark_scala/entrypoint.sh b/spark_scala/entrypoint.sh
new file mode 100755
index 0000000..4bb1557
--- /dev/null
+++ b/spark_scala/entrypoint.sh
@@ -0,0 +1,114 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Check whether there is a passwd entry for the container UID
+myuid=$(id -u)
+mygid=$(id -g)
+# turn off -e for getent because it will return error code in anonymous uid case
+set +e
+uidentry=$(getent passwd $myuid)
+set -e
+
+# If there is no passwd entry for the container UID, attempt to create one
+if [ -z "$uidentry" ] ; then
+ if [ -w /etc/passwd ] ; then
+ echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous uid}:$SPARK_HOME:/bin/false" >> /etc/passwd
+ else
+ echo "Container ENTRYPOINT failed to add passwd entry for anonymous UID"
+ fi
+fi
+
+if [ -z "$JAVA_HOME" ]; then
+ JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 'java.home' | awk '{print $3}')
+fi
+
+SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*"
+env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt
+readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt
+
+if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then
+ SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH"
+fi
+
+if ! [ -z ${PYSPARK_PYTHON+x} ]; then
+ export PYSPARK_PYTHON
+fi
+if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then
+ export PYSPARK_DRIVER_PYTHON
+fi
+
+# If HADOOP_HOME is set and SPARK_DIST_CLASSPATH is not set, set it here so Hadoop jars are available to the executor.
+# It does not set SPARK_DIST_CLASSPATH if already set, to avoid overriding customizations of this value from elsewhere e.g. Docker/K8s.
+if [ -n "${HADOOP_HOME}" ] && [ -z "${SPARK_DIST_CLASSPATH}" ]; then
+ export SPARK_DIST_CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath)"
+fi
+
+if ! [ -z ${HADOOP_CONF_DIR+x} ]; then
+ SPARK_CLASSPATH="$HADOOP_CONF_DIR:$SPARK_CLASSPATH";
+fi
+
+if ! [ -z ${SPARK_CONF_DIR+x} ]; then
+ SPARK_CLASSPATH="$SPARK_CONF_DIR:$SPARK_CLASSPATH";
+elif ! [ -z ${SPARK_HOME+x} ]; then
+ SPARK_CLASSPATH="$SPARK_HOME/conf:$SPARK_CLASSPATH";
+fi
+
+case "$1" in
+ driver)
+ shift 1
+ CMD=(
+ "$SPARK_HOME/bin/spark-submit"
+ --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
+ --deploy-mode client
+ "$@"
+ )
+ ;;
+ executor)
+ shift 1
+ CMD=(
+ ${JAVA_HOME}/bin/java
+ "${SPARK_EXECUTOR_JAVA_OPTS[@]}"
+ -Xms$SPARK_EXECUTOR_MEMORY
+ -Xmx$SPARK_EXECUTOR_MEMORY
+ -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH"
+ org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBackend
+ --driver-url $SPARK_DRIVER_URL
+ --executor-id $SPARK_EXECUTOR_ID
+ --cores $SPARK_EXECUTOR_CORES
+ --app-id $SPARK_APPLICATION_ID
+ --hostname $SPARK_EXECUTOR_POD_IP
+ --resourceProfileId $SPARK_RESOURCE_PROFILE_ID
+ --podName $SPARK_EXECUTOR_POD_NAME
+ )
+ ;;
+
+ *)
+ # Non-spark-on-k8s command provided, proceeding in pass-through mode...
+ CMD=("$@")
+ ;;
+esac
+
+# Switch to spark if no USER specified (root by default) otherwise use USER directly
+switch_spark_if_root() {
+ if [ $(id -u) -eq 0 ]; then
+ echo gosu spark
+ fi
+}
+
+# Execute the container CMD under tini for better hygiene
+exec $(switch_spark_if_root) /usr/bin/tini -s -- "${CMD[@]}"
@yosifkit Question 1, 8, 9 resolved, question 3, 4, 5, 6, 7 wait for reply! Thanks!
should we put the base image and r, python image into separate PR? (first base and left python and r, because after change that, r, python depends on base).
The CI failed due to python/r/all image depends on base image. so I guess the first PR should be the base image, right? @yosifkit
Because our build system has to be able to query/calculate inter-image dependencies, we unfortunately do not support parameterized FROM
values.
@tianon So the suggestion which @yosifkit mentioned before (question 1) is that first adding base image and then adding pyspark/sparkr.
When the version bump in future, should also be two separate PR (base first, and then pyspark/sparkr)?
Diff for 8d0e8535a523d04b7a55841be2dcff7beba8152a:
diff --git a/_bashbrew-arches b/_bashbrew-arches
index 8b13789..e85a97f 100644
--- a/_bashbrew-arches
+++ b/_bashbrew-arches
@@ -1 +1,2 @@
-
+amd64
+arm64v8
diff --git a/_bashbrew-cat b/_bashbrew-cat
index bdfae4a..f319ede 100644
--- a/_bashbrew-cat
+++ b/_bashbrew-cat
@@ -1 +1,7 @@
-Maintainers: New Image! :D (@docker-library-bot)
+Maintainers: Apache Spark Developers <[email protected]> (@ApacheSpark)
+GitRepo: https://github.com/apache/spark-docker.git
+
+Tags: 3.4.0-scala2.12-java11-ubuntu, 3.4.0-scala, scala
+Architectures: amd64, arm64v8
+GitCommit: 7f9b414de48639d69c64acfd81e6792517b86f61
+Directory: 3.4.0/scala2.12-java11-ubuntu
diff --git a/_bashbrew-list b/_bashbrew-list
index e69de29..4f0221f 100644
--- a/_bashbrew-list
+++ b/_bashbrew-list
@@ -0,0 +1,3 @@
+spark:3.4.0-scala
+spark:3.4.0-scala2.12-java11-ubuntu
+spark:scala
diff --git a/_bashbrew-list-build-order b/_bashbrew-list-build-order
index e69de29..479f94b 100644
--- a/_bashbrew-list-build-order
+++ b/_bashbrew-list-build-order
@@ -0,0 +1 @@
+spark:scala
diff --git a/spark_scala/Dockerfile b/spark_scala/Dockerfile
new file mode 100644
index 0000000..11f997f
--- /dev/null
+++ b/spark_scala/Dockerfile
@@ -0,0 +1,82 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+FROM eclipse-temurin:11-jre-focal
+
+ARG spark_uid=185
+
+RUN groupadd --system --gid=${spark_uid} spark && \
+ useradd --system --uid=${spark_uid} --gid=spark spark
+
+RUN set -ex; \
+ apt-get update; \
+ ln -s /lib /lib64; \
+ apt install -y gnupg2 wget bash tini libc6 libpam-modules krb5-user libnss3 procps net-tools gosu; \
+ mkdir -p /opt/spark; \
+ mkdir /opt/spark/python; \
+ mkdir -p /opt/spark/examples; \
+ mkdir -p /opt/spark/work-dir; \
+ chmod g+w /opt/spark/work-dir; \
+ touch /opt/spark/RELEASE; \
+ chown -R spark:spark /opt/spark; \
+ rm /bin/sh; \
+ ln -sv /bin/bash /bin/sh; \
+ echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su; \
+ chgrp root /etc/passwd && chmod ug+rw /etc/passwd; \
+ rm -rf /var/cache/apt/*; \
+ rm -rf /var/lib/apt/lists/*
+
+# Install Apache Spark
+# https://downloads.apache.org/spark/KEYS
+ENV SPARK_TGZ_URL=https://archive.apache.org/dist/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz \
+ SPARK_TGZ_ASC_URL=https://archive.apache.org/dist/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz.asc \
+ GPG_KEY=CC68B3D16FE33A766705160BA7E57908C7A4E1B1
+
+RUN set -ex; \
+ export SPARK_TMP="$(mktemp -d)"; \
+ cd $SPARK_TMP; \
+ wget -nv -O spark.tgz "$SPARK_TGZ_URL"; \
+ wget -nv -O spark.tgz.asc "$SPARK_TGZ_ASC_URL"; \
+ export GNUPGHOME="$(mktemp -d)"; \
+ gpg --keyserver hkps://keys.openpgp.org --recv-key "$GPG_KEY" || \
+ gpg --keyserver hkps://keyserver.ubuntu.com --recv-keys "$GPG_KEY"; \
+ gpg --batch --verify spark.tgz.asc spark.tgz; \
+ gpgconf --kill all; \
+ rm -rf "$GNUPGHOME" spark.tgz.asc; \
+ \
+ tar -xf spark.tgz --strip-components=1; \
+ chown -R spark:spark .; \
+ mv jars /opt/spark/; \
+ mv bin /opt/spark/; \
+ mv sbin /opt/spark/; \
+ mv kubernetes/dockerfiles/spark/decom.sh /opt/; \
+ mv examples /opt/spark/; \
+ mv kubernetes/tests /opt/spark/; \
+ mv data /opt/spark/; \
+ mv python/pyspark /opt/spark/python/pyspark/; \
+ mv python/lib /opt/spark/python/lib/; \
+ mv R /opt/spark/; \
+ chmod a+x /opt/decom.sh; \
+ cd ..; \
+ rm -rf "$SPARK_TMP";
+
+COPY entrypoint.sh /opt/
+
+ENV SPARK_HOME /opt/spark
+
+WORKDIR /opt/spark/work-dir
+
+ENTRYPOINT [ "/opt/entrypoint.sh" ]
diff --git a/spark_scala/entrypoint.sh b/spark_scala/entrypoint.sh
new file mode 100755
index 0000000..4bb1557
--- /dev/null
+++ b/spark_scala/entrypoint.sh
@@ -0,0 +1,114 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Check whether there is a passwd entry for the container UID
+myuid=$(id -u)
+mygid=$(id -g)
+# turn off -e for getent because it will return error code in anonymous uid case
+set +e
+uidentry=$(getent passwd $myuid)
+set -e
+
+# If there is no passwd entry for the container UID, attempt to create one
+if [ -z "$uidentry" ] ; then
+ if [ -w /etc/passwd ] ; then
+ echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous uid}:$SPARK_HOME:/bin/false" >> /etc/passwd
+ else
+ echo "Container ENTRYPOINT failed to add passwd entry for anonymous UID"
+ fi
+fi
+
+if [ -z "$JAVA_HOME" ]; then
+ JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 'java.home' | awk '{print $3}')
+fi
+
+SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*"
+env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt
+readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt
+
+if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then
+ SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH"
+fi
+
+if ! [ -z ${PYSPARK_PYTHON+x} ]; then
+ export PYSPARK_PYTHON
+fi
+if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then
+ export PYSPARK_DRIVER_PYTHON
+fi
+
+# If HADOOP_HOME is set and SPARK_DIST_CLASSPATH is not set, set it here so Hadoop jars are available to the executor.
+# It does not set SPARK_DIST_CLASSPATH if already set, to avoid overriding customizations of this value from elsewhere e.g. Docker/K8s.
+if [ -n "${HADOOP_HOME}" ] && [ -z "${SPARK_DIST_CLASSPATH}" ]; then
+ export SPARK_DIST_CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath)"
+fi
+
+if ! [ -z ${HADOOP_CONF_DIR+x} ]; then
+ SPARK_CLASSPATH="$HADOOP_CONF_DIR:$SPARK_CLASSPATH";
+fi
+
+if ! [ -z ${SPARK_CONF_DIR+x} ]; then
+ SPARK_CLASSPATH="$SPARK_CONF_DIR:$SPARK_CLASSPATH";
+elif ! [ -z ${SPARK_HOME+x} ]; then
+ SPARK_CLASSPATH="$SPARK_HOME/conf:$SPARK_CLASSPATH";
+fi
+
+case "$1" in
+ driver)
+ shift 1
+ CMD=(
+ "$SPARK_HOME/bin/spark-submit"
+ --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
+ --deploy-mode client
+ "$@"
+ )
+ ;;
+ executor)
+ shift 1
+ CMD=(
+ ${JAVA_HOME}/bin/java
+ "${SPARK_EXECUTOR_JAVA_OPTS[@]}"
+ -Xms$SPARK_EXECUTOR_MEMORY
+ -Xmx$SPARK_EXECUTOR_MEMORY
+ -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH"
+ org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBackend
+ --driver-url $SPARK_DRIVER_URL
+ --executor-id $SPARK_EXECUTOR_ID
+ --cores $SPARK_EXECUTOR_CORES
+ --app-id $SPARK_APPLICATION_ID
+ --hostname $SPARK_EXECUTOR_POD_IP
+ --resourceProfileId $SPARK_RESOURCE_PROFILE_ID
+ --podName $SPARK_EXECUTOR_POD_NAME
+ )
+ ;;
+
+ *)
+ # Non-spark-on-k8s command provided, proceeding in pass-through mode...
+ CMD=("$@")
+ ;;
+esac
+
+# Switch to spark if no USER specified (root by default) otherwise use USER directly
+switch_spark_if_root() {
+ if [ $(id -u) -eq 0 ]; then
+ echo gosu spark
+ fi
+}
+
+# Execute the container CMD under tini for better hygiene
+exec $(switch_spark_if_root) /usr/bin/tini -s -- "${CMD[@]}"
Related to 1:
should we put the base image and r, python image into separate PR?
There is no need for them to be separate PRs. The bashbrew
tool figures build order dependencies between images (both eclipse-temurin:xyz
-> spark
and spark:foo
-> spark:bar
). They just can't have variables in the FROM
values and all images referenced need to be a current official image or part of the changes.
We are also consider support java17. Just wanna sure java11 / java17 should be two base image, right?
Yes, thats fine. They can't really be the same base since they need different versions of java and so will need a different FROM eclipse-temurin:[version]
.
Related to 2 and 6:
It was introduced with question 6, https://github.com/apache-spark-on-k8s/spark/pull/444/files#r134075892
Oh, if that's the place using it, then just setting SHELL ["bash", "-c"]
will fix that (and be used for any RUN
lines after setting SHELL
). Though I'd personally recommend just putting all that CMD
into a proper shell script (and seems to already have been done?).
For bash variable prefix expansion, it works but can it also support sh? (I try and it works, but I'm not sure it should the best way or not)?
It looks like your script is already running as bash. Are there instances where it doesn't run as bash? The image controls the environment, so you can ensure that bash
is always available. It will not work with posix-compliant sh
.
Related to 3:
I found it was introduced in SPARK-25275 for security improvement. It should also for OpenShift, so if it's not a blocker, we prefer to keep.
Yeah, it is extra, but harmless. A stronger guarantee to prevent privilege escalation would be in recommending that users set --security-opt=no-new-privileges
(or allowPrivilegeEscalation: false
in Kubernetes).
Related to 4 and 5:
It was introduced for OpenShift, apache-spark-on-k8s/spark#404.
Reading into that issue, it seems that Spark will crash if it can't find a passwd entry
(maybe it doesn't now?), so that sounds a lot like what we have with PostgreSQL's initdb
that requires the user to exist (https://github.com/docker-library/postgres/pull/448). Perhaps something similar with libnss_wrapper
would work for spark?
BTW, a case might related, would mind share some idea about how we would switch username in entrypoint.sh like apache/spark#40831 to meet the spark on k8s user switch requirement?
:thinking: Perhaps the libnss_wrapper
trick would also work for this?
Related to 7:
I tried others apache official image like solr / flink
The solr
image uses a USER
directive in the Dockerfile, so it can be overridden with --user
on docker run
. Unfortunately, it seems flink
uses an ungated entrypoint user drop down that we have missed in review😔; that could make it frustrating to new users that see that they aren't root and try to switch but are frustrated when docker run --user root [image] bash
doesn't work as expected (without also overriding the entrypoint too).
@yosifkit
Related to 1:
https://github.com/apache/spark-docker/pull/39 we removed paramaterized FROM
Related to 2 and 6:
2 resolved in: https://github.com/apache/spark-docker/pull/41 by recovering sh
6 resolved in: https://github.com/apache/spark-docker/pull/42 by apply suggestion changes
Are there instances where it doesn't run as bash?
I think no, previous question was I thought DOI required also support sh
, for spark I think bash support is enough.
Related to 7:
Thanks for explaination, and one more question: what the default user should be specified in Docerfile (root or spark?) is recommanded by DOI?
For spark, we were using spark
/ 185
as default user (specified USER in dockerfile).
-
docker run --user spark -ti spark bash
: thespark
user will be used. -
docker run --user root -ti spark bash
: theroot
user will be used. -
docker run -ti spark bash
: thespark
user will be used (rather thanroot
). (is it regular behavior?)
If above behaviors are right, the question 7 will be resolved by https://github.com/apache/spark-docker/pull/43
but seems root
as user is a more reasonable choice, then the fix should be https://github.com/apache/spark-docker/pull/44
Related to 4 and 5:
We will take a deep look, so the suggestion is that use the libnss_wrapper
fake user when driver/executor is starting? But for other cmd, we will skip the libnss_wrapper
step?
Thanks for explaination, and one more question: what the default user should be specified in Docerfile (root or spark?) is recommanded by DOI?
We don't yet have specific requirements on when to use USER
vs a step down in an entrypoint script. I recommend that if the entrypoint needs to do setup work as root
before stepping down then using the default USER root
is appropriate (like mysql
), otherwise USER
configured to an app-specific user in the Dockerfile is better from a security perspective. If the app doesn't really have a user, we just leave it up to the user running it (like python
).
Also making sure that it still works (as much as possible) when it is run as an unexpected/random user is helpful both for things like openshift or for local development where users might run as their user id in order to bind-mount files or directories into the container.
We will take a deep look, so the suggestion is that use the
libnss_wrapper
fake user when driver/executor is starting? But for other cmd, we will skip thelibnss_wrapper
step?
Yeah, I'd recommend using libnss_wrapper
only for Spark driver/executor. Hopefully it is a relatively simple drop-in.
@yosifkit Thanks for your patient review!
- https://github.com/apache/spark-docker/pull/43 We use the
spark
USER by default finally and also addressed your inline comments. - https://github.com/apache/spark-docker/pull/45 Here is the libnss_wrapper PR to resolve
/etc/passwd
wider permissions issue. (the last issue!)
Now all issue had been addressed! I believe DOI for spark is almost ready!
Diff for d46b6dfd93ad8f470e8df11846dce14a49465ae2:
diff --git a/_bashbrew-arches b/_bashbrew-arches
index 8b13789..e85a97f 100644
--- a/_bashbrew-arches
+++ b/_bashbrew-arches
@@ -1 +1,2 @@
-
+amd64
+arm64v8
diff --git a/_bashbrew-cat b/_bashbrew-cat
index bdfae4a..9c6c450 100644
--- a/_bashbrew-cat
+++ b/_bashbrew-cat
@@ -1 +1,22 @@
-Maintainers: New Image! :D (@docker-library-bot)
+Maintainers: Apache Spark Developers <[email protected]> (@ApacheSpark)
+GitRepo: https://github.com/apache/spark-docker.git
+
+Tags: 3.4.0-scala2.12-java11-python3-r-ubuntu
+Architectures: amd64, arm64v8
+GitCommit: c07ae18355678370fd270bedb8b39ab2aceb5ac2
+Directory: 3.4.0/scala2.12-java11-python3-r-ubuntu
+
+Tags: 3.4.0-scala2.12-java11-python3-ubuntu, 3.4.0-python3, python3, 3.4.0, latest
+Architectures: amd64, arm64v8
+GitCommit: c07ae18355678370fd270bedb8b39ab2aceb5ac2
+Directory: 3.4.0/scala2.12-java11-python3-ubuntu
+
+Tags: 3.4.0-scala2.12-java11-r-ubuntu, 3.4.0-r, r
+Architectures: amd64, arm64v8
+GitCommit: c07ae18355678370fd270bedb8b39ab2aceb5ac2
+Directory: 3.4.0/scala2.12-java11-r-ubuntu
+
+Tags: 3.4.0-scala2.12-java11-ubuntu, 3.4.0-scala, scala
+Architectures: amd64, arm64v8
+GitCommit: c07ae18355678370fd270bedb8b39ab2aceb5ac2
+Directory: 3.4.0/scala2.12-java11-ubuntu
diff --git a/_bashbrew-list b/_bashbrew-list
index e69de29..f72d637 100644
--- a/_bashbrew-list
+++ b/_bashbrew-list
@@ -0,0 +1,12 @@
+spark:3.4.0
+spark:3.4.0-python3
+spark:3.4.0-r
+spark:3.4.0-scala
+spark:3.4.0-scala2.12-java11-python3-r-ubuntu
+spark:3.4.0-scala2.12-java11-python3-ubuntu
+spark:3.4.0-scala2.12-java11-r-ubuntu
+spark:3.4.0-scala2.12-java11-ubuntu
+spark:latest
+spark:python3
+spark:r
+spark:scala
diff --git a/_bashbrew-list-build-order b/_bashbrew-list-build-order
index e69de29..4970f5e 100644
--- a/_bashbrew-list-build-order
+++ b/_bashbrew-list-build-order
@@ -0,0 +1,4 @@
+spark:scala
+spark:3.4.0-scala2.12-java11-python3-r-ubuntu
+spark:latest
+spark:r
diff --git a/spark_3.4.0-scala2.12-java11-python3-r-ubuntu/Dockerfile b/spark_3.4.0-scala2.12-java11-python3-r-ubuntu/Dockerfile
new file mode 100644
index 0000000..0f1962f
--- /dev/null
+++ b/spark_3.4.0-scala2.12-java11-python3-r-ubuntu/Dockerfile
@@ -0,0 +1,30 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+FROM spark:3.4.0-scala2.12-java11-ubuntu
+
+USER root
+
+RUN set -ex; \
+ apt-get update; \
+ apt install -y python3 python3-pip; \
+ apt install -y r-base r-base-dev; \
+ rm -rf /var/cache/apt/*; \
+ rm -rf /var/lib/apt/lists/*
+
+ENV R_HOME /usr/lib/R
+
+USER spark
diff --git a/spark_latest/Dockerfile b/spark_latest/Dockerfile
new file mode 100644
index 0000000..258d806
--- /dev/null
+++ b/spark_latest/Dockerfile
@@ -0,0 +1,27 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+FROM spark:3.4.0-scala2.12-java11-ubuntu
+
+USER root
+
+RUN set -ex; \
+ apt-get update; \
+ apt install -y python3 python3-pip; \
+ rm -rf /var/cache/apt/*; \
+ rm -rf /var/lib/apt/lists/*
+
+USER spark
diff --git a/spark_r/Dockerfile b/spark_r/Dockerfile
new file mode 100644
index 0000000..4c928c6
--- /dev/null
+++ b/spark_r/Dockerfile
@@ -0,0 +1,29 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+FROM spark:3.4.0-scala2.12-java11-ubuntu
+
+USER root
+
+RUN set -ex; \
+ apt-get update; \
+ apt install -y r-base r-base-dev; \
+ rm -rf /var/cache/apt/*; \
+ rm -rf /var/lib/apt/lists/*
+
+ENV R_HOME /usr/lib/R
+
+USER spark
diff --git a/spark_scala/Dockerfile b/spark_scala/Dockerfile
new file mode 100644
index 0000000..aa754b7
--- /dev/null
+++ b/spark_scala/Dockerfile
@@ -0,0 +1,81 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+FROM eclipse-temurin:11-jre-focal
+
+ARG spark_uid=185
+
+RUN groupadd --system --gid=${spark_uid} spark && \
+ useradd --system --uid=${spark_uid} --gid=spark spark
+
+RUN set -ex; \
+ apt-get update; \
+ ln -s /lib /lib64; \
+ apt install -y gnupg2 wget bash tini libc6 libpam-modules krb5-user libnss3 procps net-tools gosu libnss-wrapper; \
+ mkdir -p /opt/spark; \
+ mkdir /opt/spark/python; \
+ mkdir -p /opt/spark/examples; \
+ mkdir -p /opt/spark/work-dir; \
+ chmod g+w /opt/spark/work-dir; \
+ touch /opt/spark/RELEASE; \
+ chown -R spark:spark /opt/spark; \
+ echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su; \
+ rm -rf /var/cache/apt/*; \
+ rm -rf /var/lib/apt/lists/*
+
+# Install Apache Spark
+# https://downloads.apache.org/spark/KEYS
+ENV SPARK_TGZ_URL=https://archive.apache.org/dist/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz \
+ SPARK_TGZ_ASC_URL=https://archive.apache.org/dist/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz.asc \
+ GPG_KEY=CC68B3D16FE33A766705160BA7E57908C7A4E1B1
+
+RUN set -ex; \
+ export SPARK_TMP="$(mktemp -d)"; \
+ cd $SPARK_TMP; \
+ wget -nv -O spark.tgz "$SPARK_TGZ_URL"; \
+ wget -nv -O spark.tgz.asc "$SPARK_TGZ_ASC_URL"; \
+ export GNUPGHOME="$(mktemp -d)"; \
+ gpg --keyserver hkps://keys.openpgp.org --recv-key "$GPG_KEY" || \
+ gpg --keyserver hkps://keyserver.ubuntu.com --recv-keys "$GPG_KEY"; \
+ gpg --batch --verify spark.tgz.asc spark.tgz; \
+ gpgconf --kill all; \
+ rm -rf "$GNUPGHOME" spark.tgz.asc; \
+ \
+ tar -xf spark.tgz --strip-components=1; \
+ chown -R spark:spark .; \
+ mv jars /opt/spark/; \
+ mv bin /opt/spark/; \
+ mv sbin /opt/spark/; \
+ mv kubernetes/dockerfiles/spark/decom.sh /opt/; \
+ mv examples /opt/spark/; \
+ mv kubernetes/tests /opt/spark/; \
+ mv data /opt/spark/; \
+ mv python/pyspark /opt/spark/python/pyspark/; \
+ mv python/lib /opt/spark/python/lib/; \
+ mv R /opt/spark/; \
+ chmod a+x /opt/decom.sh; \
+ cd ..; \
+ rm -rf "$SPARK_TMP";
+
+COPY entrypoint.sh /opt/
+
+ENV SPARK_HOME /opt/spark
+
+WORKDIR /opt/spark/work-dir
+
+USER spark
+
+ENTRYPOINT [ "/opt/entrypoint.sh" ]
diff --git a/spark_scala/entrypoint.sh b/spark_scala/entrypoint.sh
new file mode 100755
index 0000000..08fc925
--- /dev/null
+++ b/spark_scala/entrypoint.sh
@@ -0,0 +1,123 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+attempt_setup_fake_passwd_entry() {
+ # Check whether there is a passwd entry for the container UID
+ local myuid; myuid="$(id -u)"
+ # If there is no passwd entry for the container UID, attempt to fake one
+ # You can also refer to the https://github.com/docker-library/official-images/pull/13089#issuecomment-1534706523
+ # It's to resolve OpenShift random UID case.
+ # See also: https://github.com/docker-library/postgres/pull/448
+ if ! getent passwd "$myuid" &> /dev/null; then
+ local wrapper
+ for wrapper in {/usr,}/lib{/*,}/libnss_wrapper.so; do
+ if [ -s "$wrapper" ]; then
+ NSS_WRAPPER_PASSWD="$(mktemp)"
+ NSS_WRAPPER_GROUP="$(mktemp)"
+ export LD_PRELOAD="$wrapper" NSS_WRAPPER_PASSWD NSS_WRAPPER_GROUP
+ local mygid; mygid="$(id -g)"
+ printf 'spark:x:%s:%s:${SPARK_USER_NAME:-anonymous uid}:%s:/bin/false\n' "$myuid" "$mygid" "$SPARK_HOME" > "$NSS_WRAPPER_PASSWD"
+ printf 'spark:x:%s:\n' "$mygid" > "$NSS_WRAPPER_GROUP"
+ break
+ fi
+ done
+ fi
+}
+
+if [ -z "$JAVA_HOME" ]; then
+ JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 'java.home' | awk '{print $3}')
+fi
+
+SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*"
+for v in "${!SPARK_JAVA_OPT_@}"; do
+ SPARK_EXECUTOR_JAVA_OPTS+=( "${!v}" )
+done
+
+if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then
+ SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH"
+fi
+
+if ! [ -z ${PYSPARK_PYTHON+x} ]; then
+ export PYSPARK_PYTHON
+fi
+if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then
+ export PYSPARK_DRIVER_PYTHON
+fi
+
+# If HADOOP_HOME is set and SPARK_DIST_CLASSPATH is not set, set it here so Hadoop jars are available to the executor.
+# It does not set SPARK_DIST_CLASSPATH if already set, to avoid overriding customizations of this value from elsewhere e.g. Docker/K8s.
+if [ -n "${HADOOP_HOME}" ] && [ -z "${SPARK_DIST_CLASSPATH}" ]; then
+ export SPARK_DIST_CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath)"
+fi
+
+if ! [ -z ${HADOOP_CONF_DIR+x} ]; then
+ SPARK_CLASSPATH="$HADOOP_CONF_DIR:$SPARK_CLASSPATH";
+fi
+
+if ! [ -z ${SPARK_CONF_DIR+x} ]; then
+ SPARK_CLASSPATH="$SPARK_CONF_DIR:$SPARK_CLASSPATH";
+elif ! [ -z ${SPARK_HOME+x} ]; then
+ SPARK_CLASSPATH="$SPARK_HOME/conf:$SPARK_CLASSPATH";
+fi
+
+# Switch to spark if no USER specified (root by default) otherwise use USER directly
+switch_spark_if_root() {
+ if [ $(id -u) -eq 0 ]; then
+ echo gosu spark
+ fi
+}
+
+case "$1" in
+ driver)
+ shift 1
+ CMD=(
+ "$SPARK_HOME/bin/spark-submit"
+ --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
+ --deploy-mode client
+ "$@"
+ )
+ attempt_setup_fake_passwd_entry
+ # Execute the container CMD under tini for better hygiene
+ exec $(switch_spark_if_root) /usr/bin/tini -s -- "${CMD[@]}"
+ ;;
+ executor)
+ shift 1
+ CMD=(
+ ${JAVA_HOME}/bin/java
+ "${SPARK_EXECUTOR_JAVA_OPTS[@]}"
+ -Xms$SPARK_EXECUTOR_MEMORY
+ -Xmx$SPARK_EXECUTOR_MEMORY
+ -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH"
+ org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBackend
+ --driver-url $SPARK_DRIVER_URL
+ --executor-id $SPARK_EXECUTOR_ID
+ --cores $SPARK_EXECUTOR_CORES
+ --app-id $SPARK_APPLICATION_ID
+ --hostname $SPARK_EXECUTOR_POD_IP
+ --resourceProfileId $SPARK_RESOURCE_PROFILE_ID
+ --podName $SPARK_EXECUTOR_POD_NAME
+ )
+ attempt_setup_fake_passwd_entry
+ # Execute the container CMD under tini for better hygiene
+ exec $(switch_spark_if_root) /usr/bin/tini -s -- "${CMD[@]}"
+ ;;
+
+ *)
+ # Non-spark-on-k8s command provided, proceeding in pass-through mode...
+ exec "$@"
+ ;;
+esac
Just a few minor bits of feedback.
- Have you considered a
set -eo pipefail
on the entrypoint script to help prevent any errors from being silently ignored? - Some of the used environment variables like
SPARK_EXECUTOR_MEMORY
SPARK_DRIVER_BIND_ADDRESS
andSPARK_DRIVER_URL
(maybe others) don't have a default in the entrypoint script and I can't find documentation on them. Will those be added to the Docker Hub readme or maybe your Kubernetes page or configuration page?
I've got a few question on the tagging.
Is it going to be common for users to actually want Python or R but not both? Is the size difference significant? (They're both just installed directly from distro packages, right? Is their usage common enough to warrant an explicit variant, or should they rather be part of some generic documentation about how to expand the image with "additional tools" or something similar? That'd be a ~2 line Dockerfile
:sweat_smile:)
Similarly, the explicit Scala version in the tags seems strange to me - is that something that will change over time and users might want to pin to the older version? If that's something that will remain fairly constant, then I'd recommend just dropping it from those already very verbose/long tag names. :see_no_evil:
You should also consider swapping ubuntu
in your tags for the more explicit focal
so that when you inevitably upgrade to jammy
, users have a way to easily refer to the older version if things break for them. :eyes: