spark-operator icon indicating copy to clipboard operation
spark-operator copied to clipboard

Spark Thrift Server(STS) CRD

Open nicholas-fwang opened this issue 4 years ago • 12 comments

Spark Thrift Server is a daemon server that can execute spark sql through JDBC/ODBC connector. It can be usefully used in hive's execution engine and BI tool that supports JDBC/ODBC.

I have deployed thrift server on Kubernetes as below.

apiVersion: v1
kind: Pod
metadata:
  name: spark-thrift-server
  labels:
    foo: bar
spec:
  containers:
  - name: spark-thrift-server
    image: gcr.io/spark-operator/spark:v3.0.0
    args:
      - /opt/spark/bin/spark-submit
      - --master
      - k8s://https://xx.xx.x.x:443
      - --class
      - org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
      - --deploy-mode
      - client
      - --name
      - spark-sql
      - --hiveconf
      - hive.server2.thrift.port 10000
      - --conf
      - spark.executor.instances=1
      - --conf
      - spark.executor.memory=1G
      - --conf
      - spark.driver.memory=1G
      - --conf
      - spark.executor.cores=1
      - --conf
      - spark.kubernetes.namespace=spark-operator
      - --conf
      - spark.kubernetes.container.image=gcr.io/spark-operator/spark:v3.0.0
      - --conf
      - spark.kubernetes.authenticate.driver.serviceAccountName=spark-operator
      - --conf
      - spark.kubernetes.driver.pod.name=spark-thrift-server
    ports:
    - containerPort: 4040
      name: spark-ui
      protocol: TCP
    - containerPort: 10000
      name: spark-thrift
      protocol: TCP
  serviceAccount: spark-operator
  serviceAccountName: spark-operator
---
apiVersion: v1
kind: Service
metadata:
  name: spark-thrift-server
spec:
  clusterIP: None
  ports:
  - name: spark-ui
    port: 4040
    protocol: TCP
    targetPort: 4040
  - name: spark-thrift
    port: 10000
    protocol: TCP
    targetPort: 10000
  selector:
    foo: bar
  sessionAffinity: None
  type: ClusterIP

Since STS must be distributed in client mode, the spark operator could not be used directly. Therefore, it is proposed to create a new CRD and deploy the STS through the spark operator. Deployment through the spark operator has the following advantages.

  • Failover of driver(thrift server)
  • Batch Scheduler integration
  • Integrated management including monitoring with Spark Application
  • Application of various resources of kubernetes using Mutating/Validation webhook

nicholas-fwang avatar Dec 15 '20 07:12 nicholas-fwang

We had some problem to rewrite this pod into a deployment, but solved it by adding the "spark.kubernetes.driver.pod.name" and "spark.driver.host" to the conf, through enviroment variables. The problem seems to be that otherwise spark.kubernetes.driver.pod.name need to be the name of the created pod AND the service (which is a problem for deployments).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: spark-thrift-server
  labels:
    foo: bar
spec:
  replicas: 1
  selector:
    matchLabels:
      foo: bar
  template:
    metadata:
      labels:
        foo: bar
    spec:
      containers:
      - name: spark-thrift-server
        image: gcr.io/spark-operator/spark:v3.0.0
        args:
          - /opt/bitnami/spark/bin/spark-submit
          - --master
          - k8s://https://xx.xx.x.x:443
          - --class
          - org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
          - --deploy-mode
          - client
          - --name
          - spark-sql
          - --hiveconf
          - hive.server2.thrift.port 10000
          - --conf
          - spark.executor.instances=1
          - --conf
          - spark.executor.memory=1G
          - --conf
          - spark.driver.memory=1G
          - --conf
          - spark.executor.cores=1
          - --conf
          - spark.kubernetes.namespace=spark-operator
          - --conf
          - spark.kubernetes.container.image=gcr.io/spark-operator/spark:v3.0.0
          - --conf
          - spark.kubernetes.authenticate.driver.serviceAccountName=spark-operator
          - --conf
          - spark.kubernetes.driver.pod.name=$(THRIFT_POD_NAME)
          - --conf
          - spark.driver.bindAddress=$(THRIFT_POD_IP)
          - --conf
          - spark.driver.host=spark-thrift-server
        env:
        - name: THRIFT_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: THRIFT_POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        ports:
        - containerPort: 4040
          name: spark-ui
          protocol: TCP    
        - containerPort: 10000
          name: spark-thrift
          protocol: TCP
      serviceAccount: spark-operator
      serviceAccountName: spark-operator
---
apiVersion: v1
kind: Service
metadata:
  name: spark-thrift-server
spec:
  clusterIP: None
  ports:
  - name: spark-ui
    port: 4040
    protocol: TCP
    targetPort: 4040
  - name: spark-thrift
    port: 10000
    protocol: TCP
    targetPort: 10000
  selector:
    foo: bar
  sessionAffinity: None
  type: ClusterIP

manuelneubergerwn avatar Mar 18 '21 15:03 manuelneubergerwn

Hi @fisache and @ManuelNeubergerWN!

I use start-thriftserver.sh to run STS and cannot find a right way to stop it. This is a part of my code how I start and stop STS:

command:
- /bin/sh
- -c
- >
  /opt/spark/sbin/start-thriftserver.sh \
    --name sts \
    --hiveconf hive.server2.thrift.port=10000 \
    --conf spark.driver.host=$(hostname -I) \
    --conf spark.kubernetes.driver.pod.name=$(hostname);
lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "/opt/spark/sbin/stop-thriftserver.sh > /proc/1/fd/1"]

On termination preStop hook writes to log the following:

Reading package lists...
Building dependency tree...
Reading state information...
curl is already the newest version (7.64.0-4+deb10u2).
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.
no org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 to stop

Do you have any suggestions how this might be solved?

dnskr avatar Jul 14 '21 21:07 dnskr

We are only using this on Kubernetes where no .sh script is required. (Or am I missing your context here?) In our context we allways have the server running combined with some spot instanzes which he can scale up and down as required.

manuelneubergerwn avatar Jul 15 '21 06:07 manuelneubergerwn

Sorry, I think my previous message was not clear enough :) I use Kubernetes too. So I have implemented Deployment similar to yours, but instead of spark-submit and long args list I use start-thriftserver.sh and ConfigMap. Here is my deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sts
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sts
  template:
    metadata:
      labels:
        app: sts
    spec:
      containers:
        - name: spark-thrift-server
          image: gcr.io/spark-operator/spark:v3.1.1
          command:
          - /bin/sh
          - -c
          - >
            /opt/spark/sbin/start-thriftserver.sh \
              --name {{ .Release.Name }} \
              --conf spark.driver.host=$(hostname -I) \
              --conf spark.kubernetes.driver.pod.name=$(hostname);
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "/opt/spark/sbin/stop-thriftserver.sh > /proc/1/fd/1"]
          env:
            - name: SPARK_CONF_DIR
              value: "/opt/spark/conf"
          ports:
            - name: ui
              containerPort: 4040
            - name: thrift
              containerPort: 10000
          volumeMounts:
            - name: config-volume
              mountPath: /opt/spark/conf
      terminationGracePeriodSeconds: 60
      volumes:
        - name: config-volume
          configMap:
            name: sts

So the question is how to stop it in a right way? The script stop-thriftserver.sh in preStop hook writes no org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 to stop

dnskr avatar Jul 15 '21 08:07 dnskr

Sorry I am afraid I do not know the answer :/

manuelneubergerwn avatar Jul 15 '21 08:07 manuelneubergerwn

What confuses me is the fact that I cannot find any open source helm chart or other good way to run STS on Kubernetes. Spark Operator is promising but looks like it is not actively developing project anymore, so client mode (which is needed to run STS) will not be implemented in nearest future.

dnskr avatar Jul 15 '21 10:07 dnskr

Any news on this? Would be interesting to have native spark k8s operator support...

mbode-asys avatar Nov 26 '21 16:11 mbode-asys

@dnskr , But the above implementation create the master and executor inside a single pod right ? .. That is not a scalable approach right

sangeethsasidharan avatar Aug 16 '22 03:08 sangeethsasidharan

wow this case has been opened since 2020, is there existing any approach to expose a service like this? thanks in advanced1

ruslanguns avatar Jan 28 '24 10:01 ruslanguns

Hi @ruslanguns! I would recommend to look at the Apache Kyuubi project. It allows to run Spark SQL engines on demand with different share levels (connection, user, group and server) and has other features:

  • Apache Kyuubi on GitHub https://github.com/apache/kyuubi
  • The Share Level Of Kyuubi Engines https://kyuubi.readthedocs.io/en/master/deployment/engine_share_level.html
  • Kyuubi vs Spark Thrift Server comparison https://kyuubi.readthedocs.io/en/master/overview/kyuubi_vs_thriftserver.html

dnskr avatar Jan 28 '24 18:01 dnskr

It looks great. Thanks for sharing it! I'll take a look

ruslanguns avatar Jan 28 '24 19:01 ruslanguns