incubator-livy
incubator-livy copied to clipboard
[LIVY-124] Allow jar/pyfile/file to be uploaded when creating a batch session
What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/LIVY-124
Allow upload jar/python/other files only with Livy http connection for batch session.
- Create batch session with delayed flag
- Upload files with set-file, add-file, add-jar and/or and-pyfile
- Start session with start (i.e. submit to spark)
How was this patch tested?
- New unittest for batch session
- New unittest for batch servlet
- Tested manually against my spark cluster
@pmsgd would you please give an example of how to use this new feature?
@jerryshao We have separate platform and customer apps. They are connected only through http proxy with authentication and authorization. There is no direct hdfs access, so we need upload apps to spark cluster with Livy. Submit example:
- POST /batches with delayed = true
- Upload files, one or more of POST /batches/{batchId}/set-file, /batches/{batchId}/add-jar, /batches/{batchId}/add-pyfile and /batches/{batchId}/add-file - this set file, jars, pyFiles and files variables in batch request
- GET /batches/{batchId}/start - finish upload and really submit request to Spark
- the rest is like standard batch submission
Codecov Report
Merging #91 into master will decrease coverage by
0.41%
. The diff coverage is53.01%
.
@@ Coverage Diff @@
## master #91 +/- ##
============================================
- Coverage 71.49% 71.08% -0.42%
- Complexity 793 800 +7
============================================
Files 97 97
Lines 5402 5474 +72
Branches 801 821 +20
============================================
+ Hits 3862 3891 +29
- Misses 1019 1049 +30
- Partials 521 534 +13
Impacted Files | Coverage Δ | Complexity Δ | |
---|---|---|---|
.../apache/livy/server/batch/CreateBatchRequest.scala | 67.64% <100%> (+2.02%) |
20 <1> (+2) |
:arrow_up: |
...apache/livy/server/batch/BatchSessionServlet.scala | 51.85% <28.12%> (-33.87%) |
3 <0> (ø) |
|
...la/org/apache/livy/server/batch/BatchSession.scala | 78.57% <66.66%> (-7.95%) |
18 <7> (+5) |
|
...cala/org/apache/livy/scalaapi/ScalaJobHandle.scala | 52.94% <0%> (-2.95%) |
0% <0%> (ø) |
|
...main/java/org/apache/livy/rsc/ContextLauncher.java | 81.86% <0%> (-2.46%) |
18% <0%> (ø) |
|
...ain/java/org/apache/livy/rsc/driver/RSCDriver.java | 79.41% <0%> (-0.85%) |
42% <0%> (-1%) |
|
...in/java/org/apache/livy/rsc/rpc/RpcDispatcher.java | 67% <0%> (+3%) |
20% <0%> (+1%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update e3f45a0...5c343de. Read the comment docs.
Thanks for the info. The API usage seems not so straightforward from my point. Let me think a bit on this thing.
My goal was small addition with minimal changes to current API. So far only batch request is enhanced with delayed flag plus 4 new methods for file uploads. First call (batch create) creates hdfs temp area for uploads and ensures that all files will be deleted after job is finished. Uploads should be definitely separate calls for each file. API could be only single upload method with type parameter but it will completely different from interactive session. I think similar API for batches and interactive sessions is better. Last call is "start" - do you know better way how to let Livy know that all files are uploaded and job could be submitted ?
I was thinking if we can:
- associate file uploading with session creation post request, so that we can handle this in one request, not sure if Http protocol support this.
- Don't change the semantics of session creation request, upload files to Livy Server before session creation, and use these files for session creation.
I understand that so far your approach might be the easiest way to achieve this feature, but it also makes the request a little strange, I'm just thinking if there's any better solution for this.
- yes - it is possible (see https://tools.ietf.org/html/rfc7578#section-4.3) but not very common. For better reliability it is better to split upload into separate calls (lack of reconnect support). I don't know how scalatra/livy handle file uploads but very common practice is to load complete file(s) into memory before processing. We can run into memory problems with larger apps.
- if file is uploaded before session it can hang on hdfs indefinitely without session. Some timeout is possible but such maintenance service looks like basic rest hdfs API. Even more, interactive session upload should be then also refactored and handled this way (at least as optional upload possibility).
For the memory issue, I think we can stream the file input into disk, that will handle memory issue.
Can you explain more about this "if file is uploaded before session it can hang on hdfs indefinitely without session"? I'm not sure why it will be hang on hdfs indefinitely?
Imagine this scenario:
- client uploads some file
- client crashes
- file is uploaded to hdfs but session is never started and file never deleted
After some time hdfs will be filled with these "orphan" files. I think periodic maintenace task for deleting old unused files is worse than one rest call.
After some time hdfs will be filled with these "orphan" files. I think periodic maintenace task for deleting old unused files is worse than one rest call.
This should not be a big problem. We don't have to upload files to hdfs, we can cache the files locally in LivyServer, when LivyServer starts Spark Application, SparkSubmit will upload the dependencies automatically. Also we can add retention mechanism to clean orphaned files.
Can you please check how other Spark service handle such file upload issue? like Toree, Spark JobServer, or maybe others.
Looks like Toree is only for interactive mode without uploads at all. JobServer has something like simple filesystem with upload and delete support.
Keeping old unused files is bad idea everywhere. New retention mechanism is possible but why not use something what is currently in use and working ? And if we go with uploads independent to batches next step is similar filesystem like in JobServer.
@zjffdu @ajbozarth can you please comment on this feature, especially on the API design side. Thanks!
How about we design new APIs for such such usage scenario?
- request a new session/batch id from Livy server.
- Start preparing new resources(jars, py files, archieves).
- submit JSON data to start this session/batch
This is a similar process to submit application on YARN: 1) request a new application id. 2) preparing all the resources to launch application. 3) start application.
What do you think @pmsgd ?
You almost exactly described this pull request. The only difference is between 1. and 3. where the most batch information are send in 1. and not 3.. 3. is only command to start spark job.
Though they're similar, the semantics are different. In your proposal you're doing POST/POST/GET, but here is GET/POST/POST.
I was thinking that:
- We should add the new API endpoints to support this feature, not just one flag in the request.
- We should support both interactive and batch session.
- We should manage the dependencies locally in LivyServer, not in HDFS.
Are you really talking about complete API rework for batches and interactive sessions ? My goal was to keep current API as is, only with minimal backward compatible changes. I definitely have not enough time and scala skills for such big task.
Yes, I'm thinking of adding a new API set for such usage, since this is quite common in the environment where user cannot submit dependencies to HDFS before launching the session. We can add a version flag to support this new API set. There shouldn't have lots of changes in the background.
FYI I'm reading through both the discussion and the code on this trying to solidify my position before giving a longer review/opinion
Ok so I've been thinking about it for a while and though I have quite a few issues with the details of this implementation (which I'll submit as a detailed review if decide to move forward with this strategy), I like this strategy better. I'll put my more detailed thoughts into the new JIRA: https://issues.apache.org/jira/browse/LIVY-471
Any progress in this functionality? For me, supporting uploads of jars via the REST API is very desirable.
Hello Martin, no further progress from me - we decided to use another solution.
What is the "another solution"? Can you point me to a PR?
I am not authorized to post this information but it is commercial solution, not open source.
it's a useful feature, with this user can submit local jars to remote spark cluster.