incubator-livy
incubator-livy copied to clipboard
[LIVY-702]: Submit Spark apps to Kubernetes
What changes were proposed in this pull request?
This PR is one of the PRs in the series related to the splitting of the base PR https://github.com/apache/incubator-livy/pull/167 to multiple PRs to ease and speed up review and merge processes.
This PR proposes a way to submit Spark apps to Kubernetes cluster. Points covered:
- Submit batch sessions
- Submit interactive sessions
- Monitor sessions, collect logs and diagnostics information
- Restore sessions monitoring after restarts
- GC created Kubernetes resources
- Restrict the set of allowed Kubernetes namespaces
How was this patch tested?
Unit tests.
Manual testing with Kubernetes on Docker Desktop for Mac v2.1.0.1. Environment - Helm charts:
- cluster-base with custom-values.yaml:
nginx-ingress:
controller:
service:
loadBalancerIP: 127.0.0.1 # my-cluster.example.com IP address (from /etc/hosts)
loadBalancerSourceRanges: []
cluster-autoscaler:
enabled: false
oauth2-proxy:
enabled: false
- spark-cluster with custom-values.yaml:
livy:
image:
pullPolicy: Never
tag: 0.7.0-incubating-spark_2.4.3_2.11-hadoop_3.2.0-dev
ingress:
enabled: true
annotations:
kubernetes.io/ingress.class: nginx
kubernetes.io/tls-acme: "true"
nginx.ingress.kubernetes.io/rewrite-target: /$1
path: /livy/?(.*)
hosts:
- my-cluster.example.com
tls:
- secretName: spark-cluster-tls
hosts:
- my-cluster.example.com
persistence:
enabled: true
env:
LIVY_LIVY_UI_BASE1PATH: {value: "/livy"}
LIVY_SPARK_KUBERNETES_CONTAINER_IMAGE_PULL1POLICY: {value: "Never"}
LIVY_SPARK_KUBERNETES_CONTAINER_IMAGE: {value: "sasnouskikh/livy-spark:0.7.0-incubating-spark_2.4.3_2.11-hadoop_3.2.0-dev"}
LIVY_LIVY_SERVER_SESSION_STATE0RETAIN_SEC: {value: "300s"}
LIVY_LIVY_SERVER_KUBERNETES_ALLOWED1NAMESPACES: {value: "default,test"}
historyserver:
enabled: false
jupyterhub:
enabled: true
ingress:
enabled: true
annotations:
kubernetes.io/ingress.class: nginx
kubernetes.io/tls-acme: "true"
hosts:
- my-cluster.example.com
pathSuffix: ''
tls:
- secretName: spark-cluster-tls
hosts:
- my-cluster.example.com
hub:
baseUrl: /jupyterhub
publicURL: "https://my-cluster.example.com"
activeServerLimit: 10
# $> openssl rand -hex 32
cookieSecret: 41b85e5f50222b1542cc3b38a51f4d744864acca5e94eeb78c6e8c19d89eb433
pdb:
enabled: true
minAvailable: 0
proxy:
# $> openssl rand -hex 32
secretToken: cc52356e9a19a50861b22e08c92c40b8ebe617192f77edb355b9bf4b74b055de
pdb:
enabled: true
minAvailable: 0
cull:
enabled: false
timeout: 300
every: 60
- Interactive sessions - Jupyter notebook on JupyterHub with Sparkmagic
- Batch sessions - SparkPi:
curl -k -H 'Content-Type: application/json' -X POST \
-d '{
"name": "SparkPi-01",
"className": "org.apache.spark.examples.SparkPi",
"numExecutors": 2,
"file": "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.3.jar",
"args": ["10000"],
"conf": {
"spark.kubernetes.namespace": "<namespace>"
}
}' "https://my-cluster.example.com/livy/batches"
Codecov Report
Merging #249 into master will decrease coverage by
1.52%
. The diff coverage is34.53%
.
@@ Coverage Diff @@
## master #249 +/- ##
============================================
- Coverage 68.19% 66.66% -1.53%
- Complexity 964 982 +18
============================================
Files 104 105 +1
Lines 5952 6252 +300
Branches 900 955 +55
============================================
+ Hits 4059 4168 +109
- Misses 1314 1483 +169
- Partials 579 601 +22
Impacted Files | Coverage Δ | Complexity Δ | |
---|---|---|---|
...ain/java/org/apache/livy/rsc/driver/RSCDriver.java | 79.33% <0.00%> (-0.67%) |
45.00 <0.00> (ø) |
|
...e/livy/server/interactive/InteractiveSession.scala | 69.76% <0.00%> (-0.41%) |
51.00 <0.00> (ø) |
|
...rc/main/scala/org/apache/livy/utils/SparkApp.scala | 45.23% <5.55%> (-30.77%) |
1.00 <0.00> (ø) |
|
...main/scala/org/apache/livy/server/LivyServer.scala | 33.03% <20.00%> (+<0.01%) |
11.00 <1.00> (ø) |
|
...ala/org/apache/livy/utils/SparkKubernetesApp.scala | 32.42% <32.42%> (ø) |
14.00 <14.00> (?) |
|
rsc/src/main/java/org/apache/livy/rsc/RSCConf.java | 88.18% <100.00%> (+0.33%) |
9.00 <1.00> (+1.00) |
|
...rver/src/main/scala/org/apache/livy/LivyConf.scala | 96.42% <100.00%> (+0.29%) |
23.00 <2.00> (+2.00) |
|
.../scala/org/apache/livy/sessions/SessionState.scala | 61.11% <0.00%> (ø) |
2.00% <0.00%> (ø%) |
|
... and 3 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update ee7fdfc...e087b39. Read the comment docs.
@mgaido91 Could you take a look?
@mgaido91 ping.
@jahstreet I am not the best guy to take a look at this honestly. I am reviewing this PR in a few hours, but would be great to have feedbacks also from other people who are more familiar with this part of Livy. cc @vanzin @jerryshao
@jahstreet I am not the best guy to take a look at this honestly. I am reviewing this PR in a few hours, but would be great to have feedbacks also from other people who are more familiar with this part of Livy. cc @vanzin @jerryshao
Ah, I see. Will try to ping them. Thanks anyway.
Build vailure due to Travis:
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received
@jerryshao @vanzin could you take a look please? Your review would be really helpful to let that PR go.
@yiheng @arunmahadevan @mgaido91 Anything else from your side?
Rebased
Rebased to master
.
can we merge this? :)
can we merge this? :)
I would also love to! Opened for the suggestions on how to get closer to it.
Is there a timeline when this will get integrated with Livy? This would help us run Jupyter on Spark on Kubernetes. Any ETA will be very helpful! Thanks!
Hi @SarnathK , I've tried to contact the community multiple times via mailing lists with no luck to push this forward. I'm tracking the activity around this work and have a list of patches on top of it in the backlog. Also I'm always ready to provide the full support around on up Livy on Kubernetes. I could add you to the thread so you could share your use cases with the community to pay more attention to this patch if you don't mind. Don't you?
@jerryshao do you have bandwidth to review this, I've done a partial review above, but need another pair of eyes.
I can take a chance to review this, but I'm not an export of k8s, may not fully understand the pros and cons of the implementation.
@jerryshao I check all the notifications on this PR, feel free to add comments on uncertainties, will be glad to describe the idea.
@jahstreet sure. Please feel free to add me to any relevant thread. I am fully convinced that this is a great solution for many organizations that deal with high volume data. Many are looking for but don't know what and how to do. I really hope this one gets integrated with Livy. Many thanks again @jahstreet for the contribution!
Rebased to master and upgraded kubernetes client version.
In our case, we use spark magic in jupyter to connect livy to start the Spark cluster in kubernetes. We have started to use this patch in our work. As more and more applications migrate towards the cloud, this patch is definitely very valuable. hopefully this feature can be merged asap,so that more people can see and use this feature
I can take a chance to review this, but I'm not an export of k8s, may not fully understand the pros and cons of the implementation.
@jerryshao , have you given it a try already?
I had some issues with using ZooKeeper as a recovery method. When one of the zookeepers is not up in Kubernetes, its address is not published (i.e. there is none), and thus Livy cannot determine the zookeeper host. With the current version of ZooKeeper, Livy crashes because of a bug in ZooKeeper client. This has been fixed though in more recent versions (3.5.2+): ZOOKEEPER-1576.
In short, Livy cannot start if 2 out of 3 ZooKeepers are up at the moment, even though it should be able to do so. Bumping the ZooKeeper client version should resolve the issue.
@jahstreet Can we update ZooKeeper version in this PR as well?
Hi @lukatera , upgrading ZooKeeper client is out of scope of this PR and is a good candidate for a separate one. Feel free to open it. Later we can rebase this one to master once it is merged and you can cherry-pick it locally (or to your fork) to add it to this PR code. Will it work for you?
Hi @lukatera , upgrading ZooKeeper client is out of scope of this PR and is a good candidate for a separate one. Feel free to open it. Later we can rebase this one to master once it is merged and you can cherry-pick it locally (or to your fork) to add it to this PR code. Will it work for you?
Sure, makes sense. Thanks :) Is there a single branch somewhere where all the K8S functionality is present so I can try to build it locally?
@lukatera , you can check that PR: https://github.com/apache/incubator-livy/pull/167 . Basically this PR is the clean backport from the former one.
Thanks @jahstreet for your effort. Tested with:
- kubernetes 1.15.11
- Spark 2.4.5
What is holding this pr and https://github.com/apache/incubator-livy/pull/252?
Also, @jahstreet if you could add me in the threads I can explain our use case, where we use jupyter to schedule spark jobs on kubernetes.
@gmcoringa , thank you for the feedback. Can I have your e-mail to add you to the thread?
What is holding this pr and #252?
It should be reviewed and approved by the maintainers. No luck to attract enough attention so far and I'm continuing maintaining the fork.
@gmcoringa , thank you for the feedback. Can I have your e-mail to add you to the thread? Sure @jahstreet [email protected]
is there any hope to see this merged anytime soonish?
Good news, yesterday I've upgraded the helm charts to Spark 3.0.1 which unlocks K8s API 1.18.x usage. Feel free to try it out with this guide. In the meantime I'm going to backport the required changes to this PR.
FYI I recieved:
Error: chart requires kubeVersion: 1.11.0 - 1.18.0 which is incompatible with Kubernetes v1.18.9
Did you mean 1.18.x
or 1.18.0
?