incubator-livy icon indicating copy to clipboard operation
incubator-livy copied to clipboard

[LIVY-702]: Submit Spark apps to Kubernetes

Open jahstreet opened this issue 5 years ago • 75 comments

What changes were proposed in this pull request?

Jira

This PR is one of the PRs in the series related to the splitting of the base PR https://github.com/apache/incubator-livy/pull/167 to multiple PRs to ease and speed up review and merge processes.

This PR proposes a way to submit Spark apps to Kubernetes cluster. Points covered:

  • Submit batch sessions
  • Submit interactive sessions
  • Monitor sessions, collect logs and diagnostics information
  • Restore sessions monitoring after restarts
  • GC created Kubernetes resources
  • Restrict the set of allowed Kubernetes namespaces

How was this patch tested?

Unit tests.

Manual testing with Kubernetes on Docker Desktop for Mac v2.1.0.1. Environment - Helm charts:

nginx-ingress:
  controller:
    service:
      loadBalancerIP: 127.0.0.1 # my-cluster.example.com IP address (from /etc/hosts)
      loadBalancerSourceRanges: []
cluster-autoscaler:
  enabled: false
oauth2-proxy:
  enabled: false
livy:
  image:
    pullPolicy: Never
    tag: 0.7.0-incubating-spark_2.4.3_2.11-hadoop_3.2.0-dev
  ingress:
    enabled: true
    annotations:
      kubernetes.io/ingress.class: nginx
      kubernetes.io/tls-acme: "true"
      nginx.ingress.kubernetes.io/rewrite-target: /$1
    path: /livy/?(.*)
    hosts:
    - my-cluster.example.com
    tls:
    - secretName: spark-cluster-tls
      hosts:
      - my-cluster.example.com
  persistence:
    enabled: true
  env:
    LIVY_LIVY_UI_BASE1PATH: {value: "/livy"}
    LIVY_SPARK_KUBERNETES_CONTAINER_IMAGE_PULL1POLICY: {value: "Never"}
    LIVY_SPARK_KUBERNETES_CONTAINER_IMAGE: {value: "sasnouskikh/livy-spark:0.7.0-incubating-spark_2.4.3_2.11-hadoop_3.2.0-dev"}
    LIVY_LIVY_SERVER_SESSION_STATE0RETAIN_SEC: {value: "300s"}
    LIVY_LIVY_SERVER_KUBERNETES_ALLOWED1NAMESPACES: {value: "default,test"}
historyserver:
  enabled: false
jupyterhub:
  enabled: true
  ingress:
    enabled: true
    annotations:
      kubernetes.io/ingress.class: nginx
      kubernetes.io/tls-acme: "true"
    hosts:
    - my-cluster.example.com
    pathSuffix: ''
    tls:
    - secretName: spark-cluster-tls
      hosts:
      - my-cluster.example.com
  hub:
    baseUrl: /jupyterhub
    publicURL: "https://my-cluster.example.com"
    activeServerLimit: 10
    # $> openssl rand -hex 32
    cookieSecret: 41b85e5f50222b1542cc3b38a51f4d744864acca5e94eeb78c6e8c19d89eb433
    pdb:
      enabled: true
      minAvailable: 0
  proxy:
    # $> openssl rand -hex 32
    secretToken: cc52356e9a19a50861b22e08c92c40b8ebe617192f77edb355b9bf4b74b055de
    pdb:
      enabled: true
      minAvailable: 0
  cull:
    enabled: false
    timeout: 300
    every: 60
  • Interactive sessions - Jupyter notebook on JupyterHub with Sparkmagic
  • Batch sessions - SparkPi:
curl -k -H 'Content-Type: application/json' -X POST \
  -d '{
        "name": "SparkPi-01",
        "className": "org.apache.spark.examples.SparkPi",
        "numExecutors": 2,
        "file": "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.3.jar",
        "args": ["10000"],
        "conf": {
            "spark.kubernetes.namespace": "<namespace>"
        }
      }' "https://my-cluster.example.com/livy/batches"

jahstreet avatar Oct 27 '19 17:10 jahstreet

Codecov Report

Merging #249 into master will decrease coverage by 1.52%. The diff coverage is 34.53%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #249      +/-   ##
============================================
- Coverage     68.19%   66.66%   -1.53%     
- Complexity      964      982      +18     
============================================
  Files           104      105       +1     
  Lines          5952     6252     +300     
  Branches        900      955      +55     
============================================
+ Hits           4059     4168     +109     
- Misses         1314     1483     +169     
- Partials        579      601      +22     
Impacted Files Coverage Δ Complexity Δ
...ain/java/org/apache/livy/rsc/driver/RSCDriver.java 79.33% <0.00%> (-0.67%) 45.00 <0.00> (ø)
...e/livy/server/interactive/InteractiveSession.scala 69.76% <0.00%> (-0.41%) 51.00 <0.00> (ø)
...rc/main/scala/org/apache/livy/utils/SparkApp.scala 45.23% <5.55%> (-30.77%) 1.00 <0.00> (ø)
...main/scala/org/apache/livy/server/LivyServer.scala 33.03% <20.00%> (+<0.01%) 11.00 <1.00> (ø)
...ala/org/apache/livy/utils/SparkKubernetesApp.scala 32.42% <32.42%> (ø) 14.00 <14.00> (?)
rsc/src/main/java/org/apache/livy/rsc/RSCConf.java 88.18% <100.00%> (+0.33%) 9.00 <1.00> (+1.00)
...rver/src/main/scala/org/apache/livy/LivyConf.scala 96.42% <100.00%> (+0.29%) 23.00 <2.00> (+2.00)
.../scala/org/apache/livy/sessions/SessionState.scala 61.11% <0.00%> (ø) 2.00% <0.00%> (ø%)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ee7fdfc...e087b39. Read the comment docs.

codecov-io avatar Oct 27 '19 19:10 codecov-io

@mgaido91 Could you take a look?

jahstreet avatar Nov 02 '19 11:11 jahstreet

@mgaido91 ping.

jahstreet avatar Nov 10 '19 18:11 jahstreet

@jahstreet I am not the best guy to take a look at this honestly. I am reviewing this PR in a few hours, but would be great to have feedbacks also from other people who are more familiar with this part of Livy. cc @vanzin @jerryshao

mgaido91 avatar Nov 11 '19 16:11 mgaido91

@jahstreet I am not the best guy to take a look at this honestly. I am reviewing this PR in a few hours, but would be great to have feedbacks also from other people who are more familiar with this part of Livy. cc @vanzin @jerryshao

Ah, I see. Will try to ping them. Thanks anyway.

jahstreet avatar Nov 11 '19 16:11 jahstreet

Build vailure due to Travis:

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received

@jerryshao @vanzin could you take a look please? Your review would be really helpful to let that PR go.

jahstreet avatar Nov 14 '19 09:11 jahstreet

@yiheng @arunmahadevan @mgaido91 Anything else from your side?

jahstreet avatar Dec 05 '19 09:12 jahstreet

Rebased

jahstreet avatar Dec 05 '19 09:12 jahstreet

Rebased to master.

jahstreet avatar Jan 14 '20 11:01 jahstreet

can we merge this? :)

ghost avatar Feb 10 '20 16:02 ghost

can we merge this? :)

I would also love to! Opened for the suggestions on how to get closer to it.

jahstreet avatar Feb 10 '20 16:02 jahstreet

Is there a timeline when this will get integrated with Livy? This would help us run Jupyter on Spark on Kubernetes. Any ETA will be very helpful! Thanks!

SarnathK avatar Mar 28 '20 06:03 SarnathK

Hi @SarnathK , I've tried to contact the community multiple times via mailing lists with no luck to push this forward. I'm tracking the activity around this work and have a list of patches on top of it in the backlog. Also I'm always ready to provide the full support around on up Livy on Kubernetes. I could add you to the thread so you could share your use cases with the community to pay more attention to this patch if you don't mind. Don't you?

jahstreet avatar Mar 28 '20 14:03 jahstreet

@jerryshao do you have bandwidth to review this, I've done a partial review above, but need another pair of eyes.

ajbozarth avatar Mar 30 '20 02:03 ajbozarth

I can take a chance to review this, but I'm not an export of k8s, may not fully understand the pros and cons of the implementation.

jerryshao avatar Mar 30 '20 02:03 jerryshao

@jerryshao I check all the notifications on this PR, feel free to add comments on uncertainties, will be glad to describe the idea.

jahstreet avatar Mar 30 '20 07:03 jahstreet

@jahstreet sure. Please feel free to add me to any relevant thread. I am fully convinced that this is a great solution for many organizations that deal with high volume data. Many are looking for but don't know what and how to do. I really hope this one gets integrated with Livy. Many thanks again @jahstreet for the contribution!

SarnathK avatar Mar 30 '20 11:03 SarnathK

Rebased to master and upgraded kubernetes client version.

jahstreet avatar Apr 04 '20 13:04 jahstreet

In our case, we use spark magic in jupyter to connect livy to start the Spark cluster in kubernetes. We have started to use this patch in our work. As more and more applications migrate towards the cloud, this patch is definitely very valuable. hopefully this feature can be merged asap,so that more people can see and use this feature

cyliu0204 avatar Apr 22 '20 09:04 cyliu0204

I can take a chance to review this, but I'm not an export of k8s, may not fully understand the pros and cons of the implementation.

@jerryshao , have you given it a try already?

jahstreet avatar May 10 '20 13:05 jahstreet

I had some issues with using ZooKeeper as a recovery method. When one of the zookeepers is not up in Kubernetes, its address is not published (i.e. there is none), and thus Livy cannot determine the zookeeper host. With the current version of ZooKeeper, Livy crashes because of a bug in ZooKeeper client. This has been fixed though in more recent versions (3.5.2+): ZOOKEEPER-1576.

In short, Livy cannot start if 2 out of 3 ZooKeepers are up at the moment, even though it should be able to do so. Bumping the ZooKeeper client version should resolve the issue.

@jahstreet Can we update ZooKeeper version in this PR as well?

ghost avatar Jun 16 '20 07:06 ghost

Hi @lukatera , upgrading ZooKeeper client is out of scope of this PR and is a good candidate for a separate one. Feel free to open it. Later we can rebase this one to master once it is merged and you can cherry-pick it locally (or to your fork) to add it to this PR code. Will it work for you?

jahstreet avatar Jun 16 '20 08:06 jahstreet

Hi @lukatera , upgrading ZooKeeper client is out of scope of this PR and is a good candidate for a separate one. Feel free to open it. Later we can rebase this one to master once it is merged and you can cherry-pick it locally (or to your fork) to add it to this PR code. Will it work for you?

Sure, makes sense. Thanks :) Is there a single branch somewhere where all the K8S functionality is present so I can try to build it locally?

ghost avatar Jun 16 '20 09:06 ghost

@lukatera , you can check that PR: https://github.com/apache/incubator-livy/pull/167 . Basically this PR is the clean backport from the former one.

jahstreet avatar Jun 16 '20 17:06 jahstreet

Thanks @jahstreet for your effort. Tested with:

  • kubernetes 1.15.11
  • Spark 2.4.5

What is holding this pr and https://github.com/apache/incubator-livy/pull/252?

Also, @jahstreet if you could add me in the threads I can explain our use case, where we use jupyter to schedule spark jobs on kubernetes.

gmcoringa avatar Jul 14 '20 18:07 gmcoringa

@gmcoringa , thank you for the feedback. Can I have your e-mail to add you to the thread?

What is holding this pr and #252?

It should be reviewed and approved by the maintainers. No luck to attract enough attention so far and I'm continuing maintaining the fork.

jahstreet avatar Aug 01 '20 09:08 jahstreet

@gmcoringa , thank you for the feedback. Can I have your e-mail to add you to the thread? Sure @jahstreet [email protected]

gmcoringa avatar Aug 02 '20 16:08 gmcoringa

is there any hope to see this merged anytime soonish?

jesinity avatar Sep 14 '20 12:09 jesinity

Good news, yesterday I've upgraded the helm charts to Spark 3.0.1 which unlocks K8s API 1.18.x usage. Feel free to try it out with this guide. In the meantime I'm going to backport the required changes to this PR.

jahstreet avatar Oct 06 '20 11:10 jahstreet

FYI I recieved:

Error: chart requires kubeVersion: 1.11.0 - 1.18.0 which is incompatible with Kubernetes v1.18.9

Did you mean 1.18.x or 1.18.0?

kyprifog avatar Oct 06 '20 16:10 kyprifog