spark-operator icon indicating copy to clipboard operation
spark-operator copied to clipboard

Could we reconsider involved the spark history server in this repo?

Open nonpool opened this issue 4 years ago • 20 comments

background: [#164] issue

reason:

  1. Spark history server has not a stable available chart since helm/chart repo archived
  2. As a user of spark-on-k8s-operator, there is a very high probability that Spark history server is needed, because the webUI of spark diver will be inaccessible after the spark executor is completed.

Of course, if you have a better spark history visualization solution, you can also provide it. I really did not find this information in the document.

What do you think?

nonpool avatar Jul 05 '21 02:07 nonpool

There is an online plan: https://github.com/datamechanics/delight

haolixu avatar Jul 06 '21 01:07 haolixu

There is an online plan: https://github.com/datamechanics/delight

Thanks for your solution. This is a great spark history visualization solution.(The visual style is very modern and the configuration is simple and easy to use) But its limitations are also obvious. The visualization page can only be used online, and cannot be deployed by ourselves, so it is not suitable for our scenario.

nonpool avatar Jul 06 '21 07:07 nonpool

You could write your own deployment and run history server using ./sbin/start-history-server.sh, I have located an alternative hosting of the charts at https://artifacthub.io/packages/helm/spot/spark-history-server which might help.

indranilr avatar Jul 12 '21 16:07 indranilr

I am writing spark events to s3, so I build a new docker container adding a couple jars and just change the entry point to run spark history server.

FROM gcr.io/spark-operator/spark:v3.1.1-hadoop3

USER root

ADD https://xxx.com/artifactory/apixio-spark/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar $SPARK_HOME/jars/
ADD https://xxx.com/artifactory/apixio-spark/com/amazonaws/aws-java-sdk-bundle/1.7.4.2/aws-java-sdk-1.7.4.2.jar $SPARK_HOME/jars/

ENTRYPOINT bash /opt/spark/sbin/spark-daemon.sh start org.apache.spark.deploy.history.HistoryServer 1

Then just a deployment yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: spark-hs-custom
    version: 3.1.1
  name: spark-hs-custom
spec:
  replicas: 1
  selector:
    matchLabels:
      app: spark-hs-custom
      version: 3.1.1
  template:
    metadata:
      labels:
        app: spark-hs-custom
        version: 3.1.1
    spec:
      containers:
        - env:
          - name: SPARK_NO_DAEMONIZE
            value: "false"
          - name: SPARK_HISTORY_OPTS
            value: -Dspark.history.fs.logDirectory=s3a://my-bucket-name/eventLogFolder
          image: xxx.dkr.ecr.us-west-2.amazonaws.com/xxx-spark-hs:v0.0.4
          imagePullPolicy: IfNotPresent
          name: spark-hs-custom
          ports:
            - containerPort: 18080
              name: http
              protocol: TCP
          resources:
            requests:
              cpu: "2"
              memory: 10Gi
            limits:
              cpu: "2"
              memory: 10Gi

jdonnelly-apixio avatar Jul 12 '21 20:07 jdonnelly-apixio

@jdonnelly-apixio : I tried your solution (with a USER different than root) and I get this error in the logs of the spark history server Exception in thread "main" java.io.IOException: failure to login at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:841) I think that a user must be created in the docker image

stephbat avatar Jul 13 '21 12:07 stephbat

@stephbat Yea, I think I hit that issue as well when I was running as someone other than root. Google's base image uses the user 185, but I wasn't able to get it to work with that one. Can you post what you did if you figure out a solution?

jdonnelly-apixio avatar Jul 13 '21 21:07 jdonnelly-apixio

yea, confirmed I get that exception when I try to do a USER 185

21/07/13 22:00:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.hadoop.security.KerberosAuthException: failure to login: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name
        at jdk.security.auth/com.sun.security.auth.UnixPrincipal.<init>(Unknown Source)
        at jdk.security.auth/com.sun.security.auth.module.UnixLoginModule.login(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext.invoke(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext$4.run(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext$4.run(Unknown Source)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at java.base/javax.security.auth.login.LoginContext.invokePriv(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext.login(Unknown Source)
        at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926)
        at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837)
        at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
        at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
        at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
        at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2476)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2476)
        at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
        at org.apache.spark.deploy.history.HistoryServer$.createSecurityManager(HistoryServer.scala:333)
        at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:294)
        at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)

        at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1847)
        at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
        at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
        at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
        at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2476)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2476)
        at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
        at org.apache.spark.deploy.history.HistoryServer$.createSecurityManager(HistoryServer.scala:333)
        at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:294)
        at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
Caused by: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name
        at jdk.security.auth/com.sun.security.auth.UnixPrincipal.<init>(Unknown Source)
        at jdk.security.auth/com.sun.security.auth.module.UnixLoginModule.login(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext.invoke(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext$4.run(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext$4.run(Unknown Source)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at java.base/javax.security.auth.login.LoginContext.invokePriv(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext.login(Unknown Source)
        at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926)
        at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837)
        at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
        at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
        at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
        at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2476)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2476)
        at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
        at org.apache.spark.deploy.history.HistoryServer$.createSecurityManager(HistoryServer.scala:333)
        at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:294)
        at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)

        at java.base/javax.security.auth.login.LoginContext.invoke(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext$4.run(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext$4.run(Unknown Source)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at java.base/javax.security.auth.login.LoginContext.invokePriv(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext.login(Unknown Source)
        at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926)
        at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837)
        ... 10 more

jdonnelly-apixio avatar Jul 13 '21 22:07 jdonnelly-apixio

This works for me:

FROM gcr.io/spark-operator/spark:v3.1.1-hadoop3

USER root

ADD https://xxx.com/artifactory/apixio-spark/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar $SPARK_HOME/jars/
ADD https://xxx.com/artifactory/apixio-spark/com/amazonaws/aws-java-sdk-bundle/1.7.4.2/aws-java-sdk-1.7.4.2.jar $SPARK_HOME/jars/

RUN groupadd -g 185 spark && \
    useradd -u 185 -g 185 spark

USER 185
ENTRYPOINT bash /opt/spark/sbin/spark-daemon.sh start org.apache.spark.deploy.history.HistoryServer 1

jdonnelly-apixio avatar Jul 13 '21 22:07 jdonnelly-apixio

@jdonnelly-apixio @indranilr thank for your solution. In fact, I also use deployment to deploy Spark history server just like yours. But what I want to say is that we can deploy Spark history server ourselves is completely different from being included in this helm repo.

I think we should simply change the vaules.yaml to make his work well.

nonpool avatar Jul 16 '21 08:07 nonpool

@nonpool yep, kind of agreed. it would be useful if the spark-operator supported the deployment of common services like spark history server, a hive metastore, prometheus server, etc.

can a helm chart install other helm charts (if not, not sure it would make sense to duplicate helm install functionality for stuff like prometheus server inside of the spark-operator helm chart and the official prometheus-community helm chart should probably be used instead)? not sure what best practices would be from a helm stand point... maybe just some additional documentation that shows or links to how to install some common useful additional services would be a good start

jdonnelly-apixio avatar Jul 20 '21 18:07 jdonnelly-apixio

Hi @stephbat @indranilr @nonpool @jdonnelly-apixio @haolixu I am as of now confused how to install Spark History Server on kubernetes after successful installation of the Spark Operator on Kubernetes?

chetkhatri avatar May 12 '22 17:05 chetkhatri

@chetkhatri Here is my personal experience:

  1. Build an image with the following Dockerfile:
ARG SPARK_IMAGE=gcr.io/spark-operator/spark:v3.1.1-hadoop3
FROM ${SPARK_IMAGE}

USER root

# Setup dependencies for Google Cloud Storage access.
RUN rm $SPARK_HOME/jars/guava-*.jar
ADD https://repo1.maven.org/maven2/com/google/guava/guava/27.0-jre/guava-27.0-jre.jar $SPARK_HOME/jars
RUN chmod 644 $SPARK_HOME/jars/guava-27.0-jre.jar
# Add the connector jar needed to access Google Cloud Storage using the Hadoop FileSystem API.
ADD https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop3.jar $SPARK_HOME/jars
RUN chmod 644 $SPARK_HOME/jars/gcs-connector-latest-hadoop3.jar

USER ${spark_uid}

ENTRYPOINT ${SPARK_HOME}/sbin/start-history-server.sh

I've added some stuff related to GCP. Feel free to add your own for AWS/Azure whatever... 2. Create a new chart with helm create 3. Change the following in templates/deployment.yaml:

  • spec.template.spec.containers[0].ports.containerPort -> 18080
  • spec.template.spec.containers[0].env ->
env:
            - name: SPARK_NO_DAEMONIZE
              value: "false"
            - name: SPARK_CONF_DIR
              value: {{ .Values.config.sparkConfDirectory }}
            - name: SPARK_HISTORY_OPTS
              value: "-Dspark.history.fs.logDirectory={{ .Values.config.historyLogDirectory }}"
  1. Add config section to values.yaml
config:
  sparkConfDirectory: "/opt/spark/conf"
  historyLogDirectory: ""
  1. Fill values.yaml sections according to your needs (services, ingresses etc...)
  2. Install your chart
  3. Done I hope I didn't forget anything

oleksiilopasov avatar May 13 '22 17:05 oleksiilopasov

This works for me:

FROM gcr.io/spark-operator/spark:v3.1.1-hadoop3

USER root

ADD https://xxx.com/artifactory/apixio-spark/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar $SPARK_HOME/jars/
ADD https://xxx.com/artifactory/apixio-spark/com/amazonaws/aws-java-sdk-bundle/1.7.4.2/aws-java-sdk-1.7.4.2.jar $SPARK_HOME/jars/

RUN groupadd -g 185 spark && \
    useradd -u 185 -g 185 spark

USER 185
ENTRYPOINT bash /opt/spark/sbin/spark-daemon.sh start org.apache.spark.deploy.history.HistoryServer 1

awesome,this works, but i have no idea how does it works

liangchen-datanerd avatar Jul 26 '22 06:07 liangchen-datanerd

Hi everyone, I've one question: are you also able to collect driver/executor stdout/stderr logs that you can see through kubectl logs?

gFazzari avatar Jan 19 '23 14:01 gFazzari

Has anyone tried to do this more recently? I'm trying with the apache/spark:3.5.0 container image and have lost a day to dependency hell. A Spark History helm chart in this repo or documentation on how to get this up and running would be very welcome indeed.

vrd83 avatar Jan 16 '24 02:01 vrd83