incubator-livy
incubator-livy copied to clipboard
[LIVY-588]: Full support for Spark on Kubernetes
NOTE: this PR is deprecated and kept for discussions history only. Please refer the #249 to get the latest state of the work.
What changes were proposed in this pull request?
This PR is a new feature proposal: full support for Spark on Kubernetes (inspired by SparkYarnApp implementation).
Since Spark on Kubernetes has been released relatively long ago this can be a good idea to include Kubernetes support to Livy project as well, as it can solve much problems related to working with Spark on Kubernetes, it can fully replace Yarn in case of working atop Kubernetes cluster:
- Livy UI has cached logs/diagnostics page
- Livy UI shows links to Spark UI and Spark History Server
- With Kubernetes Ingress resource Livy can be configured to serve as an orchestrator of Spark Apps atop Kubernetes (PR includes Nginx Ingress support option to create routes to Spark UI)
- Nginx Ingress solves
basePath
support for Spark UI and History Server as well as has lots of auth integrations available: https://github.com/kubernetes/ingress-nginx - Livy UI can be integrated with Grafana Loki logs (PR provides solution for that)
Dockerfiles repo: https://github.com/jahstreet/spark-on-kubernetes-docker Helm charts: https://github.com/jahstreet/spark-on-kubernetes-helm
Associated JIRA: https://issues.apache.org/jira/browse/LIVY-588
Design concept: https://github.com/jahstreet/spark-on-kubernetes-helm/blob/develop/README.md
How was this patch tested?
Was tested manually on AKS cluster (Azure Kubernetes Services), Kubernetes v1.11.8:
- Image: Spark 2.4.3 with Hadoop 3.2.0 (https://github.com/jahstreet/spark-on-kubernetes-docker)
- History Server: https://github.com/helm/charts/tree/master/stable/spark-history-server
- Jupyter Notebook with Sparkmagic: https://github.com/jahstreet/spark-on-kubernetes-helm/tree/master/charts/jupyter
What do you think on that?
@vanzin please take a look.
Just to set expectations, it's very unlikely I'll be able to look at this PR (or any other really) any time soon.
Just to set expectations, it's very unlikely I'll be able to look at this PR (or any other really) any time soon.
Well, then I'll try to prepare as much as I can till you become available. Hope anyone from community will be able to share the feedback on the work done.
Codecov Report
Merging #167 into master will decrease coverage by
3.47%
. The diff coverage is26.71%
.
@@ Coverage Diff @@
## master #167 +/- ##
============================================
- Coverage 68.6% 65.12% -3.48%
- Complexity 904 940 +36
============================================
Files 100 102 +2
Lines 5666 6291 +625
Branches 850 946 +96
============================================
+ Hits 3887 4097 +210
- Misses 1225 1614 +389
- Partials 554 580 +26
Impacted Files | Coverage Δ | Complexity Δ | |
---|---|---|---|
...e/livy/server/interactive/InteractiveSession.scala | 68.75% <0%> (-0.37%) |
46 <0> (+2) |
|
...rver/src/main/scala/org/apache/livy/LivyConf.scala | 96.46% <100%> (+0.6%) |
22 <1> (+1) |
:arrow_up: |
...ala/org/apache/livy/utils/SparkKubernetesApp.scala | 20.36% <20.36%> (ø) |
0 <0> (?) |
|
...main/scala/org/apache/livy/server/LivyServer.scala | 32.43% <33.33%> (-3.53%) |
11 <0> (ø) |
|
...ain/java/org/apache/livy/rsc/driver/RSCDriver.java | 79.25% <50%> (+1.28%) |
45 <0> (+4) |
:arrow_up: |
...rc/main/scala/org/apache/livy/utils/SparkApp.scala | 67.5% <55.55%> (-8.5%) |
1 <0> (ø) |
|
...in/scala/org/apache/livy/repl/SQLInterpreter.scala | 62.5% <0%> (-7.88%) |
9% <0%> (+2%) |
|
...ain/scala/org/apache/livy/utils/SparkYarnApp.scala | 66.01% <0%> (-7.23%) |
40% <0%> (+7%) |
|
...n/scala/org/apache/livy/server/AccessManager.scala | 75.47% <0%> (-5.38%) |
46% <0%> (+2%) |
|
...cala/org/apache/livy/scalaapi/ScalaJobHandle.scala | 52.94% <0%> (-2.95%) |
7% <0%> (ø) |
|
... and 20 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 7dee3cc...7f6ef8a. Read the comment docs.
I'm going to experiment with this a bit: We're running Spark on Kubernetes widely and we are seeking for also migrating our notebook usage on top of Kubernetes. The benefits we are seeing from Kubernetes is the elasticity with the associated cost savings, and the ability to track and analyse the resource usage of individual jobs closely.
From my quick glance on the source I will probably be missing more extensive support for customizing the created drivers (I assume that Livy creates the drivers as pods to the cluster, which then creates the executors). In our usage now with Spark on Kubernetes we supply about 20 different --conf options to the driver, from which some carry job specific information such as name and owner.
I'm going to experiment with this a bit: We're running Spark on Kubernetes widely and we are seeking for also migrating our notebook usage on top of Kubernetes. The benefits we are seeing from Kubernetes is the elasticity with the associated cost savings, and the ability to track and analyse the resource usage of individual jobs closely.
From my quick glance on the source I will probably be missing more extensive support for customizing the created drivers (I assume that Livy creates the drivers as pods to the cluster, which then creates the executors). In our usage now with Spark on Kubernetes we supply about 20 different --conf options to the driver, from which some carry job specific information such as name and owner.
Sounds cool, will be glad to assist you during the experiments. Maybe you can share with me the cases you are looking the solution for and I'm sure this would be helpful for designing the requirements to the features to implement within this work.
By the way in the near future I'll prepare the guidelines for deployment, customization and usage options of Livy on Kubernetes. Will share the progress on that.
I built Livy on my own machine based on your branch and the Dockerfile in your repository. I got it running so that it created the driver pod, but I was unable to fully start the driver due to using my own spark image, which requires some configuration parameters to be passed in.
Here's some feedback:
-
Helm chart doesn't allow to specify "serviceAccount" property for Livy.
-
Couldn't find a way to set namespace which Livy must use. It seems to try to want to search all pods in all namespaces. Also need to set the namespace where the pods are created (Seems to be fixed to "default")
-
Could you provide a way to fully customise the driver pod specification? I would want to set custom volumes and volume-mounts, environment variables, labels, sidecar containers and possibly even customise the command line arguments for the driver.
-
Also a way to provide custom spark configuration settings for the driver pod would be required.
-
Support for macros for both customising the driver pod and the extra spark configuration options. I would at least need the id of the livy session (eg. "livy-session-2-9SZP8Ijv") to be inserted to both the pod template and the spark configuration options.
Unfortunately I don't know Scala really well, so I couldn't really dig into the code easily to determine how this works, so I'm not unable to provide you with more detailed recommendations.
@garo Thanks for the review.
Here are some explanations on you questions:
-
First version of chart were done without RBAC support. I've just done with RBAC support solution for Livy chart and not yet merged it, you can refer feature branch https://github.com/jahstreet/spark-on-kubernetes-helm/blob/charts/livy/rbac-support/charts/livy/values.yaml: serviceAccount: //Specifies whether a service account should be created create: true //The name of the service account to use. //If not set and create is true, a name is generated using the fullname template name:
-
Livy searches for a Driver Pod in all namespaces (theoretically user may want to use any namespace to submit job to) for the first time to initialize KubernetesApplication object, then it uses that object (which contains field namespace) to get Spark Pods states and logs and looks for that information only within 1 target namespace (I've added comments to the lines where this logic is done).
-
By default Livy should submit Spark App to
default
namespace (if it is not done so, than I need to make a fix ;)) ). You can change that behavior by addingspark.kubernetes.namespace=<desired_namespaces>
to /opt/spark/conf/spark-defaults.conf in Livy container. Livy entrypoint is done so that it can set spark-defaults configs with env variables, so you can set Livy container envLIVY_SPARK_KUBERNETES_NAMESPACE=<desired_namespaces>
to change Spark Apps default namespace. In the new version of Livy chart I set it to.Release.Namespace
. And of course you can pass it as additional conf on App submission within POST request to Livy:{ ... "conf": { "spark.kubernetes.namespace":"<desired_namespaces>"}, ... }
to overwrite defaults. -
Please refer some customization explanations to Livy: https://github.com/jahstreet/spark-on-kubernetes-helm/tree/master/charts/livy#customizing-livy-server Following that approach you can set any config defaults to both Livy and Spark. If you need to overwrite some - do that on job submission in the POST request body.
-
To customize Driver Pod spec we need a custom build of Spark installed to Livy image (Livy just runs spark-submit). I refer to the official releases of Apache Spark and do not see available options for that at present (including adding sidecars). But it has configs to set
volumes and volume-mounts, environment variables, labels
(https://spark.apache.org/docs/latest/running-on-kubernetes.html#spark-properties), which we can set to default values as I described before. -
Customize the command line arguments for the driver - you mean application args? You can pass them on job submission in the POST request body: https://livy.incubator.apache.org/docs/latest/rest-api.html Or you need custom spark-submit script options?
-
Why do you need the id of the Livy session and what kind of macros for both customizing the driver pod and the extra spark configuration options do you mean?
Could you provide the example of a Job you wanna run? I hope I will be able to show you the available solutions using that example.
Thank you very much for the detailed response! I'm just leaving for my easter holiday so I am not going to be able to actually try again until after that.
I however created this gist showing how we create the spark drivers in our current workflow: We run Azkaban (like a glorified cron service) which runs our spark applications. Each application (ie. a scheduled cron execution) starts a spark driver pod into kubernetes. If you look at this gist https://gist.github.com/garo/90c6e69d2430ef7d93ca9f564ba86059 there is first a build of spark-submit configuration parameters following with the yaml for the driver pod.
So I naturally tried to think how I can use Livy to launch the same image with same kind of settings. I think that with your explanations I can implement most if not all of these settings except the run_id.
Lets continue this discussion after easter. Have a great week!
Just to clarify to be on the same page... When you send request to Livy, eg:
kubectl exec livy-pod -- curl -H 'Content-Type: application/json' -X POST
-d '{ "name": "spark-pi", "proxyUser": "livy_user", "numExecutors": 2, "conf": { "spark.kubernetes.container.image": "sasnouskikh/spark:2.4.1-hadoop_3.2.0", "spark.kubernetes.container.image.pullPolicy": "Always", "spark.kubernetes.namespace": "default" }, "file": "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.1.jar", "className": "org.apache.spark.examples.SparkPi", "args": [ "1000000" ] }' "http://localhost:8998/batches"
Under the hood livy just runs spark-submit for you:
spark-submit
--master k8s://https://<k8s_api_server>:443
--deploy-mode cluster
--name spark-pi
--class org.apache.spark.examples.SparkPi
--conf spark.executor.instances=2
--conf spark.kubernetes.container.image=sasnouskikh/spark:2.4.1-hadoop_3.2.0
--conf spark.kubernetes.container.image.pullPolicy=Always
--conf spark.kubernetes.namespace=default
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.1.jar 100000
Starting from Spark 2.4.0, spark-submit in cluster-mode creates Driver Pod, which entrypoint runs spark-submit in client mode, just like you try to do in the gist.
So I do not see why you may want to deploy customized Driver Pod in that particular case.
Most of --conf
may be moved to defaults and you will have pretty JSON.
Pushgateway sidecar may be deployed as a separate Pod, just configure prometheus sink with right pushgateway-address. All other configs for Driver Pod customization are already covered by docs for Spark on Kubernetes.
Good week for you!
I'm getting the following error
19/04/23 16:26:04 INFO LineBufferedStream: 19/04/23 16:26:04 INFO Client: Deployed Spark application livy-session-0 into Kubernetes.
19/04/23 16:26:04 INFO LineBufferedStream: 19/04/23 16:26:04 INFO ShutdownHookManager: Shutdown hook called
19/04/23 16:26:04 INFO LineBufferedStream: 19/04/23 16:26:04 INFO ShutdownHookManager: Deleting directory /tmp/spark-62b7810e-667d-47e7-9940-72f8cd5f91e9
19/04/23 16:26:04 DEBUG InteractiveSession: InteractiveSession 0 app state changed from RUNNING to FINISHED
19/04/23 16:26:04 DEBUG InteractiveSession: InteractiveSession 0 session state change from starting to dead
19/04/23 16:26:10 DEBUG AbstractByteBuf: -Dio.netty.buffer.bytebuf.checkAccessible: true
19/04/23 16:26:10 DEBUG ResourceLeakDetector: -Dio.netty.leakDetection.level: simple
19/04/23 16:26:10 DEBUG ResourceLeakDetector: -Dio.netty.leakDetection.maxRecords: 4
19/04/23 16:26:10 DEBUG Recycler: -Dio.netty.recycler.maxCapacity.default: 262144
19/04/23 16:26:10 DEBUG Recycler: -Dio.netty.recycler.linkCapacity: 16
19/04/23 16:26:10 DEBUG KryoMessageCodec: Decoded message of type org.apache.livy.rsc.rpc.Rpc$SaslMessage (41 bytes)
19/04/23 16:26:10 DEBUG RpcServer$SaslServerHandler: Handling SASL challenge message...
19/04/23 16:26:10 DEBUG RpcServer$SaslServerHandler: Sending SASL challenge response...
19/04/23 16:26:10 DEBUG KryoMessageCodec: Encoded message of type org.apache.livy.rsc.rpc.Rpc$SaslMessage (98 bytes)
19/04/23 16:26:10 DEBUG KryoMessageCodec: Decoded message of type org.apache.livy.rsc.rpc.Rpc$SaslMessage (275 bytes)
19/04/23 16:26:10 DEBUG RpcServer$SaslServerHandler: Handling SASL challenge message...
19/04/23 16:26:10 DEBUG RpcServer$SaslServerHandler: Sending SASL challenge response...
19/04/23 16:26:10 DEBUG KryoMessageCodec: Encoded message of type org.apache.livy.rsc.rpc.Rpc$SaslMessage (45 bytes)
19/04/23 16:26:10 DEBUG RpcServer$SaslServerHandler: SASL negotiation finished with QOP auth.
19/04/23 16:26:10 DEBUG ContextLauncher: New RPC client connected from [id: 0x2ae2b51a, L:/10.233.94.163:10000 - R:/10.233.94.164:39008].
19/04/23 16:26:10 DEBUG KryoMessageCodec: Decoded message of type org.apache.livy.rsc.rpc.Rpc$MessageHeader (5 bytes)
19/04/23 16:26:10 DEBUG KryoMessageCodec: Decoded message of type org.apache.livy.rsc.BaseProtocol$RemoteDriverAddress (94 bytes)
19/04/23 16:26:10 DEBUG RpcDispatcher: [RegistrationHandler] Received RPC message: type=CALL id=0 payload=org.apache.livy.rsc.BaseProtocol$RemoteDriverAddress
19/04/23 16:26:10 DEBUG ContextLauncher: Received driver info for client [id: 0x2ae2b51a, L:/10.233.94.163:10000 - R:/10.233.94.164:39008]: livy-session-0-1556036763266-driver/10000.
19/04/23 16:26:10 DEBUG KryoMessageCodec: Encoded message of type org.apache.livy.rsc.rpc.Rpc$MessageHeader (5 bytes)
19/04/23 16:26:10 DEBUG KryoMessageCodec: Encoded message of type org.apache.livy.rsc.rpc.Rpc$NullMessage (2 bytes)
19/04/23 16:26:10 DEBUG RpcDispatcher: Channel [id: 0x2ae2b51a, L:/10.233.94.163:10000 ! R:/10.233.94.164:39008] became inactive.
19/04/23 16:26:10 ERROR RSCClient: Failed to connect to context.
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207)
at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1206)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:525)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:510)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:492)
at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:949)
at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:208)
at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:167)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:394)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
at java.lang.Thread.run(Thread.java:748)
19/04/23 16:26:10 ERROR RSCClient: RPC error.
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207)
at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1206)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:525)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:510)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:492)
at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:949)
at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:208)
at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:167)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:394)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
at java.lang.Thread.run(Thread.java:748)
19/04/23 16:26:10 INFO RSCClient: Failing pending job d509417c-c894-416d-8218-625b278da8b7 due to shutdown.
Spark is running in different namespace that Livy. Service is also created just before this message appears so it does not seems to be error in ordering. Am I doing something wrong?
@lukatera Good day,
From the first look I see that you are using either Livy build not from that PR (I've fixed the similar issue in that commit), or your Livy and/or Spark is configured not appropriately.
I require to know more about your environment to move further. Could you please provide something from the following in addition:
- What is your Kubernetes installation and version?
- What Docker images do you use? What version of Spark is running (this Livy was tested with Spark 2.4.0+, 2.3.* wasn't good enough and had some unpleasant bugs)? From what commit have you built Livy (if you did so)?
- What are the livy.conf and livy-client.conf content (/opt/livy/conf/...)?
- What are the Spark Job configs:
kubectl describe configmap <spark-driver-pod-conf-map> -n <spark-job-namespace>
? What is the JSON body you post to create a session? - What are the Spark Driver Pod logs?
- If you use Helm charts - what are the versions and what are the custom values you provide on install?
- Maybe some more debugging info you feel may be related?
Currently I run Livy build from this PR's branch with the provided Helm charts and Docker images both on Minikube for Windows and on Azure AKS without issues.
Will be happy to help, thanks for the feedback.
@lukatera Good day,
From the first look I see that you are using either Livy build not from that PR (I've fixed the similar issue in that commit), or your Livy and/or Spark is configured not appropriately.
I require to know more about your environment to move further. Could you please provide something from the following in addition:
- What is your Kubernetes installation and version?
- What Docker images do you use? What version of Spark is running (this Livy was tested with Spark 2.4.0+, 2.3.* wasn't good enough and had some unpleasant bugs)? From what commit have you built Livy (if you did so)?
- What are the livy.conf and livy-client.conf content (/opt/livy/conf/...)?
- What are the Spark Job configs:
kubectl describe configmap <spark-driver-pod-conf-map> -n <spark-job-namespace>
? What is the JSON body you post to create a session?- What are the Spark Driver Pod logs?
- If you use Helm charts - what are the versions and what are the custom values you provide on install?
- Maybe some more debugging info you feel may be related?
Currently I run Livy build from this PR's branch with the provided Helm charts and Docker images both on Minikube for Windows and on Azure AKS without issues.
Will be happy to help, thanks for the feedback.
Thanks for the help! I was checking out master branch from your repo instead of this specific one. All good now!
@lukatera Cool, nice to know that. Do not hesitate to ask if you face any problems on that.
Great PR! one suggestion is maybe adding the authenticated livy user to both driver and executor pods labels. It should be simple enough since spark already supports arbitrary labels through submit command spark.kubernetes.driver.label.[LabelName]
.
@igorcalabria Thanks for the feedback, what mechanism of getting livy user value do you propose? I see an option of setting those labels with proxyUser value on Spark job submission from POST request to Livy. Did you mean that?
@igorcalabria Thanks for the feedback, what mechanism of getting livy user value do you propose? I see an option of setting those labels with proxyUser value on Spark job submission from POST request to Livy. Did you mean that?
@jahstreet It could be that, but I was thinking about the authenticated user(via kerberos) making the requests. To give you more context, this could be great for resource usage tracking, especially if livy has more info available about the principal, like groups or even teams.
I'm not familiar with livy's codebase, but I'm guessing that the param we want is owner
on the Session
classes:
- https://github.com/apache/incubator-livy/blob/master/server/src/main/scala/org/apache/livy/server/interactive/InteractiveSession.scala#L71
- https://github.com/apache/incubator-livy/blob/master/server/src/main/scala/org/apache/livy/server/batch/BatchSession.scala#L61
@igorcalabria Oh, I see, will try that, thanks.
Shouldn't the Livy docker/helm charts should also be part of the livy repository since its most likely that users would want to run Livy in a K8s container while launching spark on k8s?. Maybe it can be added as a follow-up task.
Well that a good idea. Since this patch will be accepted and merged I would love to take care of that. Thanks for your feedback.
@jahstreet There's a minor issue when a interactive session is recovered from a filesystem. After a restart, livy correctly recovers the session, but it stops displaying the spark master's url on the "Sessions" tab. The config used was pretty standard
livy.server.recovery.mode = recovery
livy.server.recovery.state-store = filesystem
livy.server.recovery.state-store.url = ...
Livy impersonation seems to not be working. I'm trying to use it with Jupyter and sparkmagic with no luck.
%%configure -f
{
"proxyUser": "customUser"
}
However, I'm not familiar with Livy enough to say how this should work and if it requires kerberized HDFS cluster.
If I set HADOOP_USER_NAME
env variable on the driver and the executor, it runs stuff on top of hadoop as that user.
I saw this in the driver logs however:
19/06/06 14:53:11 DEBUG UserGroupInformation: PrivilegedAction as:customUser (auth:PROXY) via root (auth:SIMPLE) from:org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:150)
Livy impersonation seems to not be working. I'm trying to use it with Jupyter and sparkmagic with no luck.
%%configure -f { "proxyUser": "customUser" }
However, I'm not familiar with Livy enough to say how this should work and if it requires kerberized HDFS cluster.
If I set
HADOOP_USER_NAME
env variable on the driver and the executor, it runs stuff on top of hadoop as that user.I saw this in the driver logs however:
19/06/06 14:53:11 DEBUG UserGroupInformation: PrivilegedAction as:customUser (auth:PROXY) via root (auth:SIMPLE) from:org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:150)
Actually I'm not familiar with Livy impersonation and do not know how it should behave. Maybe someone can clarify that?
@jahstreet thanks a lot for your contribution, I'm wondering do you a design doc about k8s support on Livy?
@jerryshao thx. I'm finalizing it. Will add to the PR next week.
@jahstreet ping
@jahstreet thanks a lot for your contribution, I'm wondering do you a design doc about k8s support on Livy?
Here is my view of design concept I planned to implement: https://github.com/jahstreet/spark-on-kubernetes-helm/blob/develop/README.md
Thanks @jahstreet , I will take a look at it. BTW, it would be better to attach the design doc to JIRA.
@jahstreet , looks like if the livy pod restarts, the application link (that points to driverUI) for the interactive session does not appear anymore. The log link also doesn't seem to work. Wont Livy poll k8s to update the driver UI url or is it something broken in the Livy session recovery? This works for the batch session though.
@jahstreet , looks like if the livy pod restarts, the application link (that points to driverUI) for the interactive session does not appear anymore. The log link also doesn't seem to work. Wont Livy poll k8s to update the driver UI url or is it something broken in the Livy session recovery? This works for the batch session though.
Added the fix: https://github.com/apache/incubator-livy/pull/167/files#diff-7649a51ad4bddc91b6f1038e06479d41R404-R409
I'm now trying to get this PR to work and I'm facing an issue where the started driver pod fails to connect back to the livy server to port 10000.
There's a relevant log line from the driver:
RSCDriver:160 - Connecting to: livy.spark.svc:10000
The livy.spark.svc is a valid DNS hostname pointing to the service which points to the livy, but it maps only port 80. If the driver would instead have the livy server pod ip it would work. I'm not sure how this is supposed to work.