spark-operator
spark-operator copied to clipboard
Support of spark-shell
Hi,
I would like to know if it's currently possible to submit a spark-shell job using the spark operator.
I tried to deploy a SparkApplication using mainClass: org.apache.spark.repl.Main but this does not work in client mode and I get:
Application State:
Error Message: Driver Pod not found
State: FAILING
and in cluster mode, I get:
Warning SparkApplicationSubmissionFailed 2s spark-operator failed to submit SparkApplication spark-shell: failed to run spark-submit for SparkApplication default/spark-shell: Exception in thread "main" org.apache.spark.SparkException: Cluster deploy mode is not applicable to Spark shells.
at org.apache.spark.deploy.SparkSubmit.error(SparkSubmit.scala:857)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:292)
I think this should not be too difficult to achieve but a piece is missing here.
Thanks in advance.
The Spark operator in its current form is well suited for cluster mode jobs, but not so well for interactive client mode apps, e.g., spark-shell. We can make the operator create a pod running spark-shell and a headless service for the executors to connect to the driver. Is this something you are interested?
Yes, this is exactly what I had in mind, this would be very handy. At the moment I'm going to go with a helm chart but the operator with all its configuration options would be great.
Question: how do you plan to use the pod running spark-shell? Attaching into the container and run spark-shell? I'm not sure what the entry point of the driver container will be in this case.
I am using spark-shell command as entrypoint and I attach to the pod with kubectl attaach -it my-driver The only downside is that I can not detach without stopping the spark driver.
Here is the helm template I use:
apiVersion: v1
kind: Pod
metadata:
name: {{ include "spark-shell-k8s.fullname" . }}
labels:
{{ include "spark-shell-k8s.labels" . | indent 4 }}
spec:
imagePullSecrets:
- name: {{ .Values.image.pullSecrets }}
restartPolicy: Never
serviceAccountName: {{ .Values.spark.serviceAccountName }}
automountServiceAccountToken: true
containers:
- name: {{ .Chart.Name }}
image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
imagePullPolicy: {{ .Values.image.pullPolicy }}
tty: true
stdin: true
{{- if .Values.spark.persistentVolumeClaims }}
volumeMounts:
{{- range .Values.spark.persistentVolumeClaims }}
- name: "{{- toString . }}-volume"
mountPath: "/mnt/{{- toString . }}"
{{- end }}
{{- end }}
command:
- /opt/spark/bin/spark-shell
- --master
- k8s://https://$(KUBERNETES_SERVICE_HOST)
- --name
- {{ include "spark-shell-k8s.fullname" . }}
args:
- --conf
- spark.driver.host={{ include "spark-shell-k8s.fullname" . }}.{{ .Release.Namespace }}
- --conf
- spark.driver.port=7079
- --conf
- spark.executor.instances={{ .Values.spark.executor.instances }}
- --conf
- spark.executor.cores={{ .Values.spark.executor.cores }}
- --conf
- spark.kubernetes.namespace={{ .Release.Namespace }}
- --conf
- spark.kubernetes.container.image={{ .Values.image.repository }}:{{ .Values.image.tag }}
- --conf
- spark.kubernetes.container.image.pullPolicy={{ .Values.image.pullPolicy }}
- --conf
- spark.kubernetes.container.image.pullSecrets={{ .Values.image.pullSecrets }}
- --conf
- spark.kubernetes.driver.pod.name={{ include "spark-shell-k8s.fullname" . }}
{{- range .Values.spark.persistentVolumeClaims }}
- --conf
- spark.kubernetes.executor.volumes.persistentVolumeClaim.{{- toString . }}-volume.options.claimName={{- toString . }}
- --conf
- spark.kubernetes.executor.volumes.persistentVolumeClaim.{{- toString . }}-volume.mount.path=/mnt/{{- toString . }}
{{- end }}
ports:
- name: driver
protocol: TCP
containerPort: 7079
- name: ui
protocol: TCP
containerPort: 4040
{{- if .Values.spark.persistentVolumeClaims }}
volumes:
{{- range .Values.spark.persistentVolumeClaims }}
- name: "{{- toString . }}-volume"
persistentVolumeClaim:
claimName: "{{- toString . }}"
{{- end }}
{{- end }}
And there is an associated headless service on top of that.
Cool. We can definitely add support for this by creating a driver pod running the spark-shell, and a headless service for the driver pod.
Any progress on this ? it will be good to have something like:
sparkctl shell --<pyspark|spark> <<drive_name>> -n <<namespace>>
Any progress on it?
any progress on the issue ?
@liyinan926 any progress on this issue?
Any progress on this issue?
Is there any progress now?
Any progress on this issue?