spark-operator icon indicating copy to clipboard operation
spark-operator copied to clipboard

Support of spark-shell

Open jblankfeld opened this issue 6 years ago • 12 comments

Hi, I would like to know if it's currently possible to submit a spark-shell job using the spark operator. I tried to deploy a SparkApplication using mainClass: org.apache.spark.repl.Main but this does not work in client mode and I get:

Application State:
    Error Message:  Driver Pod not found
    State:          FAILING

and in cluster mode, I get:

Warning  SparkApplicationSubmissionFailed  2s    spark-operator  failed to submit SparkApplication spark-shell: failed to run spark-submit for SparkApplication default/spark-shell: Exception in thread "main" org.apache.spark.SparkException: Cluster deploy mode is not applicable to Spark shells.
           at org.apache.spark.deploy.SparkSubmit.error(SparkSubmit.scala:857)
           at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:292)

I think this should not be too difficult to achieve but a piece is missing here.

Thanks in advance.

jblankfeld avatar Sep 17 '19 12:09 jblankfeld

The Spark operator in its current form is well suited for cluster mode jobs, but not so well for interactive client mode apps, e.g., spark-shell. We can make the operator create a pod running spark-shell and a headless service for the executors to connect to the driver. Is this something you are interested?

liyinan926 avatar Sep 17 '19 20:09 liyinan926

Yes, this is exactly what I had in mind, this would be very handy. At the moment I'm going to go with a helm chart but the operator with all its configuration options would be great.

jblankfeld avatar Sep 18 '19 08:09 jblankfeld

Question: how do you plan to use the pod running spark-shell? Attaching into the container and run spark-shell? I'm not sure what the entry point of the driver container will be in this case.

liyinan926 avatar Sep 18 '19 21:09 liyinan926

I am using spark-shell command as entrypoint and I attach to the pod with kubectl attaach -it my-driver The only downside is that I can not detach without stopping the spark driver. Here is the helm template I use:

apiVersion: v1
kind: Pod
metadata:
  name: {{ include "spark-shell-k8s.fullname" . }}
  labels:
{{ include "spark-shell-k8s.labels" . | indent 4 }}
spec:
  imagePullSecrets:
    - name: {{ .Values.image.pullSecrets }}
  restartPolicy: Never
  serviceAccountName: {{ .Values.spark.serviceAccountName }}
  automountServiceAccountToken: true
  containers:
    - name: {{ .Chart.Name }}
      image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
      imagePullPolicy: {{ .Values.image.pullPolicy }}
      tty: true
      stdin: true
      {{- if .Values.spark.persistentVolumeClaims }}
      volumeMounts:
      {{- range .Values.spark.persistentVolumeClaims }}
        - name: "{{- toString . }}-volume"
          mountPath: "/mnt/{{- toString . }}"
      {{- end }}
      {{- end }}
      command:
        - /opt/spark/bin/spark-shell
        - --master
        - k8s://https://$(KUBERNETES_SERVICE_HOST)
        - --name
        - {{ include "spark-shell-k8s.fullname" . }}
      args:
        - --conf
        - spark.driver.host={{ include "spark-shell-k8s.fullname" . }}.{{ .Release.Namespace }}
        - --conf
        - spark.driver.port=7079
        - --conf
        - spark.executor.instances={{ .Values.spark.executor.instances }}
        - --conf
        - spark.executor.cores={{ .Values.spark.executor.cores }}
        - --conf
        - spark.kubernetes.namespace={{ .Release.Namespace }}
        - --conf
        - spark.kubernetes.container.image={{ .Values.image.repository }}:{{ .Values.image.tag }}
        - --conf
        - spark.kubernetes.container.image.pullPolicy={{ .Values.image.pullPolicy }}
        - --conf
        - spark.kubernetes.container.image.pullSecrets={{ .Values.image.pullSecrets }}
        - --conf
        - spark.kubernetes.driver.pod.name={{ include "spark-shell-k8s.fullname" . }}
        {{- range .Values.spark.persistentVolumeClaims }}
        - --conf
        - spark.kubernetes.executor.volumes.persistentVolumeClaim.{{- toString . }}-volume.options.claimName={{- toString . }}
        - --conf
        - spark.kubernetes.executor.volumes.persistentVolumeClaim.{{- toString . }}-volume.mount.path=/mnt/{{- toString . }}
        {{- end }}
      ports:
        - name: driver
          protocol: TCP
          containerPort: 7079
        - name: ui
          protocol: TCP
          containerPort: 4040
  {{- if .Values.spark.persistentVolumeClaims }}
  volumes:
  {{- range .Values.spark.persistentVolumeClaims }}
    - name: "{{- toString . }}-volume"
      persistentVolumeClaim:
        claimName: "{{- toString . }}"
  {{- end }}
  {{- end }}

And there is an associated headless service on top of that.

jblankfeld avatar Sep 25 '19 08:09 jblankfeld

Cool. We can definitely add support for this by creating a driver pod running the spark-shell, and a headless service for the driver pod.

liyinan926 avatar Sep 25 '19 15:09 liyinan926

Any progress on this ? it will be good to have something like:

sparkctl shell --<pyspark|spark> <<drive_name>> -n <<namespace>>

kodelint avatar Nov 05 '19 22:11 kodelint

Any progress on it?

arghya18 avatar Jun 16 '21 19:06 arghya18

any progress on the issue ?

RedwanAlkurdi avatar Oct 06 '21 13:10 RedwanAlkurdi

@liyinan926 any progress on this issue?

enricorotundo avatar Oct 12 '21 13:10 enricorotundo

Any progress on this issue?

hopper-signifyd avatar Apr 07 '22 20:04 hopper-signifyd

Is there any progress now?

yikouyigetudou avatar Aug 11 '22 08:08 yikouyigetudou

Any progress on this issue?

zeddit avatar Nov 20 '23 15:11 zeddit