incubator-livy
incubator-livy copied to clipboard
[WIP]: [Livy-336]: Livy should not spawn one thread per job to track the job on Yarn (Use YARN REST API)
What changes were proposed in this pull request?
Currently, Livy spawns one thread to for each Spark application to track the status of the application on YARN. The threads are called yarnAppMonitorThread-xxxx
and make multiple calls to YARN using YARN's Java API to get the status of running applications or to kill applications. Spawning so many threads is a bottle neck and limits the number of Spark applications that can be submitted in a short span of time.
This pull request addresses the above issue in 2 ways:
- It creates only one thread to poll YARN and get the status of all Spark applications.
- It replaces the Java API with YARN REST API to eliminate multiple calls to YARN.
How was this patch tested?
- The existing unit tests and integration test are updated to reflect the changes in the code.
- Every iteration of this PR is tested internally at PayPal at dev and production development. Some iterations have been running in productions for months.
What is not in the scope of this pull request?
Based on the discussion and feedback on this pull request, and based on the feedback
- Calls to
yarnClient.kill
wont be replaced with REST API to kill YARN apps because:- The current kill REST API is experimental.
- Maven dependencies to YARN won't be removed because:
- There is minimal benefit in removing all YARN dependencies. Livy still depends on Hadoop libraries.
- If we remove YARN dependencies, we should find a different way to get the Name Node URL. One way to do that is to add new parameters in `livy.con, but that adds complexity and possibly confusion.
Task-url: https://issues.apache.org/jira/browse/LIVY-336
we can drop all maven dependencies to YARN
Not sure that brings a lot of gains; unless something changed, Livy uses a bunch of HDFS APIs already, so removing YARN would probably not remove a whole lot of dependencies.
Not sure that brings a lot of gains; unless something changed, Livy uses a bunch of HDFS APIs already, so removing YARN would probably not remove a whole lot of dependencies.
On top of that, if we remove all dependencies to YARN, we have to put the YARN cofings, such as resouce manager REST URL, in livy.conf
. If we keep the dependencies, we can read the configs from yarn-site.xml and build the REST URL.
Codecov Report
Merging #44 into master will decrease coverage by
4.2%
. The diff coverage is47.54%
.
@@ Coverage Diff @@
## master #44 +/- ##
============================================
- Coverage 70.73% 66.52% -4.21%
+ Complexity 788 787 -1
============================================
Files 97 99 +2
Lines 5388 5556 +168
Branches 802 850 +48
============================================
- Hits 3811 3696 -115
- Misses 1037 1323 +286
+ Partials 540 537 -3
Impacted Files | Coverage Δ | Complexity Δ | |
---|---|---|---|
...main/scala/org/apache/livy/server/LivyServer.scala | 1.75% <0%> (-35.31%) |
2 <0> (-8) |
|
...rver/src/main/scala/org/apache/livy/LivyConf.scala | 92.96% <100%> (-2.97%) |
12 <0> (-3) |
|
...rc/main/scala/org/apache/livy/utils/SparkApp.scala | 53.12% <37.5%> (-27.65%) |
1 <0> (ø) |
|
...in/scala/org/apache/livy/utils/YarnInterface.scala | 42.23% <42.23%> (ø) |
20 <20> (?) |
|
...ain/scala/org/apache/livy/utils/SparkYarnApp.scala | 30.26% <59.25%> (-31.71%) |
33 <22> (+4) |
|
...cala/org/apache/livy/utils/ApplicationReport.scala | 93.75% <93.75%> (ø) |
1 <1> (?) |
|
...r/src/main/scala/org/apache/livy/utils/Clock.scala | 42.85% <0%> (-57.15%) |
0% <0%> (ø) |
|
core/src/main/scala/org/apache/livy/Logging.scala | 54.54% <0%> (-18.19%) |
0% <0%> (ø) |
|
... and 25 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 05bfa15...1555157. Read the comment docs.
What's the benefit of using REST API?
The main benefit is that with one call we can get all the info we need for the app, that is appId, state, finalStatus, diagnostics, trackingUrl, SparkUiUrl, etc. With Yarn's Java API, we can get appId with one call, but to get trakingURL, SparkUiUrl, etc., we have to make extra calls. First we need to get the attemptId, then we need to get the containerId, and then we can get the remaining information.
There is one other potential benefit, which is we can drop dependencies to YARN. But, as Marcelo described, the gains from dropping YARN dependencies are minimal.
I am alright with either YARN's Java API or REST API. I just added this PR because the proposal on the mailing list recommended REST.
Thanks @ajbozarth. This was helpful feedback.
I don't have a strong preference on changing to REST APIs, except one advantage you mentioned above (get all infos in one request). I'm not sure usually how many application will we launch in one Livy server, if it is only tens or hundreds of application, I think the communication overhead should not be big.
Also avoiding yarn dependency is not a big problem, anyway we still need to depend on hadoop jars.
One difference I can think of is to visit security Hadoop, now because we change to REST API, so we should provide spnego principal/keytab, rather than Livy principal/keytab.
I don't have a strong preference on changing to REST APIs, except one advantage you mentioned above (get all infos in one request).
The main gain from this PR is LIVY-336: Livy should not spawn one thread per job to track the job on Yarn.
One difference I can think of is to visit security Hadoop, now because we change to REST API, so we should provide spnego principal/keytab, rather than Livy principal/keytab.
YARN's Cluster Applications API does not need authentication with one exception: deleting a YARN app needs authentication. But the delete API is experimental and I am not sure if it is a good idea to use it.
Since dropping the YARN dependencies has minimal benefits, it makes sense to kill YARN apps with java