What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/LIVY-124

Allow upload jar/python/other files only with Livy http connection for batch session.

Create batch session with delayed flag
Upload files with set-file, add-file, add-jar and/or and-pyfile
Start session with start (i.e. submit to spark)

How was this patch tested?

New unittest for batch session
New unittest for batch servlet
Tested manually against my spark cluster

May 04 '18 13:05 pmsgd

@pmsgd would you please give an example of how to use this new feature?

May 07 '18 04:05 jerryshao

@jerryshao We have separate platform and customer apps. They are connected only through http proxy with authentication and authorization. There is no direct hdfs access, so we need upload apps to spark cluster with Livy. Submit example:

POST /batches with delayed = true
Upload files, one or more of POST /batches/{batchId}/set-file, /batches/{batchId}/add-jar, /batches/{batchId}/add-pyfile and /batches/{batchId}/add-file - this set file, jars, pyFiles and files variables in batch request
GET /batches/{batchId}/start - finish upload and really submit request to Spark
the rest is like standard batch submission

May 09 '18 08:05 pmsgd

Codecov Report

Merging #91 into master will decrease coverage by 0.41%. The diff coverage is 53.01%.

@@             Coverage Diff              @@
##             master      #91      +/-   ##
============================================
- Coverage     71.49%   71.08%   -0.42%     
- Complexity      793      800       +7     
============================================
  Files            97       97              
  Lines          5402     5474      +72     
  Branches        801      821      +20     
============================================
+ Hits           3862     3891      +29     
- Misses         1019     1049      +30     
- Partials        521      534      +13

Impacted Files	Coverage Δ	Complexity Δ
.../apache/livy/server/batch/CreateBatchRequest.scala	`67.64% <100%> (+2.02%)`	`20 <1> (+2)`	:arrow_up:
...apache/livy/server/batch/BatchSessionServlet.scala	`51.85% <28.12%> (-33.87%)`	`3 <0> (ø)`
...la/org/apache/livy/server/batch/BatchSession.scala	`78.57% <66.66%> (-7.95%)`	`18 <7> (+5)`
...cala/org/apache/livy/scalaapi/ScalaJobHandle.scala	`52.94% <0%> (-2.95%)`	`0% <0%> (ø)`
...main/java/org/apache/livy/rsc/ContextLauncher.java	`81.86% <0%> (-2.46%)`	`18% <0%> (ø)`
...ain/java/org/apache/livy/rsc/driver/RSCDriver.java	`79.41% <0%> (-0.85%)`	`42% <0%> (-1%)`
...in/java/org/apache/livy/rsc/rpc/RpcDispatcher.java	`67% <0%> (+3%)`	`20% <0%> (+1%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e3f45a0...5c343de. Read the comment docs.

May 09 '18 11:05 codecov-io

Thanks for the info. The API usage seems not so straightforward from my point. Let me think a bit on this thing.

May 09 '18 11:05 jerryshao

My goal was small addition with minimal changes to current API. So far only batch request is enhanced with delayed flag plus 4 new methods for file uploads. First call (batch create) creates hdfs temp area for uploads and ensures that all files will be deleted after job is finished. Uploads should be definitely separate calls for each file. API could be only single upload method with type parameter but it will completely different from interactive session. I think similar API for batches and interactive sessions is better. Last call is "start" - do you know better way how to let Livy know that all files are uploaded and job could be submitted ?

May 09 '18 12:05 pmsgd

I was thinking if we can:

associate file uploading with session creation post request, so that we can handle this in one request, not sure if Http protocol support this.
Don't change the semantics of session creation request, upload files to Livy Server before session creation, and use these files for session creation.

I understand that so far your approach might be the easiest way to achieve this feature, but it also makes the request a little strange, I'm just thinking if there's any better solution for this.

May 09 '18 12:05 jerryshao

yes - it is possible (see https://tools.ietf.org/html/rfc7578#section-4.3) but not very common. For better reliability it is better to split upload into separate calls (lack of reconnect support). I don't know how scalatra/livy handle file uploads but very common practice is to load complete file(s) into memory before processing. We can run into memory problems with larger apps.
if file is uploaded before session it can hang on hdfs indefinitely without session. Some timeout is possible but such maintenance service looks like basic rest hdfs API. Even more, interactive session upload should be then also refactored and handled this way (at least as optional upload possibility).

May 09 '18 13:05 pmsgd

For the memory issue, I think we can stream the file input into disk, that will handle memory issue.

Can you explain more about this "if file is uploaded before session it can hang on hdfs indefinitely without session"? I'm not sure why it will be hang on hdfs indefinitely?

May 10 '18 00:05 jerryshao

Imagine this scenario:

client uploads some file
client crashes
file is uploaded to hdfs but session is never started and file never deleted

After some time hdfs will be filled with these "orphan" files. I think periodic maintenace task for deleting old unused files is worse than one rest call.

May 10 '18 07:05 pmsgd

After some time hdfs will be filled with these "orphan" files. I think periodic maintenace task for deleting old unused files is worse than one rest call.

This should not be a big problem. We don't have to upload files to hdfs, we can cache the files locally in LivyServer, when LivyServer starts Spark Application, SparkSubmit will upload the dependencies automatically. Also we can add retention mechanism to clean orphaned files.

May 10 '18 07:05 jerryshao

Can you please check how other Spark service handle such file upload issue? like Toree, Spark JobServer, or maybe others.

May 10 '18 07:05 jerryshao

Looks like Toree is only for interactive mode without uploads at all. JobServer has something like simple filesystem with upload and delete support.

May 10 '18 11:05 pmsgd

Keeping old unused files is bad idea everywhere. New retention mechanism is possible but why not use something what is currently in use and working ? And if we go with uploads independent to batches next step is similar filesystem like in JobServer.

May 10 '18 11:05 pmsgd

@zjffdu @ajbozarth can you please comment on this feature, especially on the API design side. Thanks!

May 14 '18 03:05 jerryshao

How about we design new APIs for such such usage scenario?

request a new session/batch id from Livy server.
Start preparing new resources(jars, py files, archieves).
submit JSON data to start this session/batch

This is a similar process to submit application on YARN: 1) request a new application id. 2) preparing all the resources to launch application. 3) start application.

What do you think @pmsgd ?

May 14 '18 06:05 jerryshao

You almost exactly described this pull request. The only difference is between 1. and 3. where the most batch information are send in 1. and not 3.. 3. is only command to start spark job.

May 14 '18 07:05 pmsgd

Though they're similar, the semantics are different. In your proposal you're doing POST/POST/GET, but here is GET/POST/POST.

I was thinking that:

We should add the new API endpoints to support this feature, not just one flag in the request.
We should support both interactive and batch session.
We should manage the dependencies locally in LivyServer, not in HDFS.

May 14 '18 08:05 jerryshao

Are you really talking about complete API rework for batches and interactive sessions ? My goal was to keep current API as is, only with minimal backward compatible changes. I definitely have not enough time and scala skills for such big task.

May 14 '18 08:05 pmsgd

Yes, I'm thinking of adding a new API set for such usage, since this is quite common in the environment where user cannot submit dependencies to HDFS before launching the session. We can add a version flag to support this new API set. There shouldn't have lots of changes in the background.

May 14 '18 08:05 jerryshao

FYI I'm reading through both the discussion and the code on this trying to solidify my position before giving a longer review/opinion

May 14 '18 18:05 ajbozarth

Ok so I've been thinking about it for a while and though I have quite a few issues with the details of this implementation (which I'll submit as a detailed review if decide to move forward with this strategy), I like this strategy better. I'll put my more detailed thoughts into the new JIRA: https://issues.apache.org/jira/browse/LIVY-471

May 18 '18 18:05 ajbozarth

Any progress in this functionality? For me, supporting uploads of jars via the REST API is very desirable.

Feb 26 '19 15:02 martinhartig

Hello Martin, no further progress from me - we decided to use another solution.

Feb 27 '19 14:02 pmsgd

What is the "another solution"? Can you point me to a PR?

Mar 06 '19 16:03 martinhartig

I am not authorized to post this information but it is commercial solution, not open source.

Mar 07 '19 10:03 pmsgd

it's a useful feature, with this user can submit local jars to remote spark cluster.

Apr 20 '20 10:04 mashuai191

incubator-livy
incubator-livy copied to clipboard

[LIVY-124] Allow jar/pyfile/file to be uploaded when creating a batch session

What changes were proposed in this pull request?

How was this patch tested?

Codecov Report

incubator-livy incubator-livy copied to clipboard

[LIVY-124] Allow jar/pyfile/file to be uploaded when creating a batch session

What changes were proposed in this pull request?

How was this patch tested?

Codecov Report

incubator-livy
incubator-livy copied to clipboard