spark-operator
spark-operator copied to clipboard
Could we reconsider involved the spark history server in this repo?
background: [#164] issue
reason:
Spark history serverhas not a stable available chart since helm/chart repo archived- As a user of
spark-on-k8s-operator, there is a very high probability thatSpark history serveris needed, because the webUI of spark diver will be inaccessible after the spark executor is completed.
Of course, if you have a better spark history visualization solution, you can also provide it. I really did not find this information in the document.
What do you think?
There is an online plan: https://github.com/datamechanics/delight
There is an online plan: https://github.com/datamechanics/delight
Thanks for your solution. This is a great spark history visualization solution.(The visual style is very modern and the configuration is simple and easy to use) But its limitations are also obvious. The visualization page can only be used online, and cannot be deployed by ourselves, so it is not suitable for our scenario.
You could write your own deployment and run history server using ./sbin/start-history-server.sh, I have located an alternative hosting of the charts at https://artifacthub.io/packages/helm/spot/spark-history-server which might help.
I am writing spark events to s3, so I build a new docker container adding a couple jars and just change the entry point to run spark history server.
FROM gcr.io/spark-operator/spark:v3.1.1-hadoop3
USER root
ADD https://xxx.com/artifactory/apixio-spark/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar $SPARK_HOME/jars/
ADD https://xxx.com/artifactory/apixio-spark/com/amazonaws/aws-java-sdk-bundle/1.7.4.2/aws-java-sdk-1.7.4.2.jar $SPARK_HOME/jars/
ENTRYPOINT bash /opt/spark/sbin/spark-daemon.sh start org.apache.spark.deploy.history.HistoryServer 1
Then just a deployment yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: spark-hs-custom
version: 3.1.1
name: spark-hs-custom
spec:
replicas: 1
selector:
matchLabels:
app: spark-hs-custom
version: 3.1.1
template:
metadata:
labels:
app: spark-hs-custom
version: 3.1.1
spec:
containers:
- env:
- name: SPARK_NO_DAEMONIZE
value: "false"
- name: SPARK_HISTORY_OPTS
value: -Dspark.history.fs.logDirectory=s3a://my-bucket-name/eventLogFolder
image: xxx.dkr.ecr.us-west-2.amazonaws.com/xxx-spark-hs:v0.0.4
imagePullPolicy: IfNotPresent
name: spark-hs-custom
ports:
- containerPort: 18080
name: http
protocol: TCP
resources:
requests:
cpu: "2"
memory: 10Gi
limits:
cpu: "2"
memory: 10Gi
@jdonnelly-apixio : I tried your solution (with a USER different than root) and I get this error in the logs of the spark history server
Exception in thread "main" java.io.IOException: failure to login at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:841)
I think that a user must be created in the docker image
@stephbat Yea, I think I hit that issue as well when I was running as someone other than root. Google's base image uses the user 185, but I wasn't able to get it to work with that one. Can you post what you did if you figure out a solution?
yea, confirmed I get that exception when I try to do a USER 185
21/07/13 22:00:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.hadoop.security.KerberosAuthException: failure to login: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name
at jdk.security.auth/com.sun.security.auth.UnixPrincipal.<init>(Unknown Source)
at jdk.security.auth/com.sun.security.auth.module.UnixLoginModule.login(Unknown Source)
at java.base/javax.security.auth.login.LoginContext.invoke(Unknown Source)
at java.base/javax.security.auth.login.LoginContext$4.run(Unknown Source)
at java.base/javax.security.auth.login.LoginContext$4.run(Unknown Source)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.login.LoginContext.invokePriv(Unknown Source)
at java.base/javax.security.auth.login.LoginContext.login(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926)
at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837)
at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2476)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2476)
at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
at org.apache.spark.deploy.history.HistoryServer$.createSecurityManager(HistoryServer.scala:333)
at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:294)
at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1847)
at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2476)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2476)
at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
at org.apache.spark.deploy.history.HistoryServer$.createSecurityManager(HistoryServer.scala:333)
at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:294)
at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
Caused by: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name
at jdk.security.auth/com.sun.security.auth.UnixPrincipal.<init>(Unknown Source)
at jdk.security.auth/com.sun.security.auth.module.UnixLoginModule.login(Unknown Source)
at java.base/javax.security.auth.login.LoginContext.invoke(Unknown Source)
at java.base/javax.security.auth.login.LoginContext$4.run(Unknown Source)
at java.base/javax.security.auth.login.LoginContext$4.run(Unknown Source)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.login.LoginContext.invokePriv(Unknown Source)
at java.base/javax.security.auth.login.LoginContext.login(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926)
at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837)
at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2476)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2476)
at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
at org.apache.spark.deploy.history.HistoryServer$.createSecurityManager(HistoryServer.scala:333)
at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:294)
at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
at java.base/javax.security.auth.login.LoginContext.invoke(Unknown Source)
at java.base/javax.security.auth.login.LoginContext$4.run(Unknown Source)
at java.base/javax.security.auth.login.LoginContext$4.run(Unknown Source)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.login.LoginContext.invokePriv(Unknown Source)
at java.base/javax.security.auth.login.LoginContext.login(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926)
at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837)
... 10 more
This works for me:
FROM gcr.io/spark-operator/spark:v3.1.1-hadoop3
USER root
ADD https://xxx.com/artifactory/apixio-spark/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar $SPARK_HOME/jars/
ADD https://xxx.com/artifactory/apixio-spark/com/amazonaws/aws-java-sdk-bundle/1.7.4.2/aws-java-sdk-1.7.4.2.jar $SPARK_HOME/jars/
RUN groupadd -g 185 spark && \
useradd -u 185 -g 185 spark
USER 185
ENTRYPOINT bash /opt/spark/sbin/spark-daemon.sh start org.apache.spark.deploy.history.HistoryServer 1
@jdonnelly-apixio @indranilr thank for your solution.
In fact, I also use deployment to deploy Spark history server just like yours.
But what I want to say is that we can deploy Spark history server ourselves is completely different from being included in this helm repo.
I think we should simply change the vaules.yaml to make his work well.
@nonpool yep, kind of agreed. it would be useful if the spark-operator supported the deployment of common services like spark history server, a hive metastore, prometheus server, etc.
can a helm chart install other helm charts (if not, not sure it would make sense to duplicate helm install functionality for stuff like prometheus server inside of the spark-operator helm chart and the official prometheus-community helm chart should probably be used instead)? not sure what best practices would be from a helm stand point... maybe just some additional documentation that shows or links to how to install some common useful additional services would be a good start
Hi @stephbat @indranilr @nonpool @jdonnelly-apixio @haolixu I am as of now confused how to install Spark History Server on kubernetes after successful installation of the Spark Operator on Kubernetes?
@chetkhatri Here is my personal experience:
- Build an image with the following Dockerfile:
ARG SPARK_IMAGE=gcr.io/spark-operator/spark:v3.1.1-hadoop3
FROM ${SPARK_IMAGE}
USER root
# Setup dependencies for Google Cloud Storage access.
RUN rm $SPARK_HOME/jars/guava-*.jar
ADD https://repo1.maven.org/maven2/com/google/guava/guava/27.0-jre/guava-27.0-jre.jar $SPARK_HOME/jars
RUN chmod 644 $SPARK_HOME/jars/guava-27.0-jre.jar
# Add the connector jar needed to access Google Cloud Storage using the Hadoop FileSystem API.
ADD https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop3.jar $SPARK_HOME/jars
RUN chmod 644 $SPARK_HOME/jars/gcs-connector-latest-hadoop3.jar
USER ${spark_uid}
ENTRYPOINT ${SPARK_HOME}/sbin/start-history-server.sh
I've added some stuff related to GCP. Feel free to add your own for AWS/Azure whatever...
2. Create a new chart with helm create
3. Change the following in templates/deployment.yaml:
spec.template.spec.containers[0].ports.containerPort-> 18080spec.template.spec.containers[0].env->
env:
- name: SPARK_NO_DAEMONIZE
value: "false"
- name: SPARK_CONF_DIR
value: {{ .Values.config.sparkConfDirectory }}
- name: SPARK_HISTORY_OPTS
value: "-Dspark.history.fs.logDirectory={{ .Values.config.historyLogDirectory }}"
- Add
configsection tovalues.yaml
config:
sparkConfDirectory: "/opt/spark/conf"
historyLogDirectory: ""
- Fill
values.yamlsections according to your needs (services, ingresses etc...) - Install your chart
- Done I hope I didn't forget anything
This works for me:
FROM gcr.io/spark-operator/spark:v3.1.1-hadoop3 USER root ADD https://xxx.com/artifactory/apixio-spark/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar $SPARK_HOME/jars/ ADD https://xxx.com/artifactory/apixio-spark/com/amazonaws/aws-java-sdk-bundle/1.7.4.2/aws-java-sdk-1.7.4.2.jar $SPARK_HOME/jars/ RUN groupadd -g 185 spark && \ useradd -u 185 -g 185 spark USER 185 ENTRYPOINT bash /opt/spark/sbin/spark-daemon.sh start org.apache.spark.deploy.history.HistoryServer 1
awesome,this works, but i have no idea how does it works
Hi everyone, I've one question: are you also able to collect driver/executor stdout/stderr logs that you can see through kubectl logs?
Has anyone tried to do this more recently? I'm trying with the apache/spark:3.5.0 container image and have lost a day to dependency hell. A Spark History helm chart in this repo or documentation on how to get this up and running would be very welcome indeed.