incubator-livy icon indicating copy to clipboard operation
incubator-livy copied to clipboard

[LIVY-124] Allow jar/pyfile/file to be uploaded when creating a batch session

Open pmsgd opened this issue 6 years ago • 26 comments

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/LIVY-124

Allow upload jar/python/other files only with Livy http connection for batch session.

  1. Create batch session with delayed flag
  2. Upload files with set-file, add-file, add-jar and/or and-pyfile
  3. Start session with start (i.e. submit to spark)

How was this patch tested?

  • New unittest for batch session
  • New unittest for batch servlet
  • Tested manually against my spark cluster

pmsgd avatar May 04 '18 13:05 pmsgd

@pmsgd would you please give an example of how to use this new feature?

jerryshao avatar May 07 '18 04:05 jerryshao

@jerryshao We have separate platform and customer apps. They are connected only through http proxy with authentication and authorization. There is no direct hdfs access, so we need upload apps to spark cluster with Livy. Submit example:

  1. POST /batches with delayed = true
  2. Upload files, one or more of POST /batches/{batchId}/set-file, /batches/{batchId}/add-jar, /batches/{batchId}/add-pyfile and /batches/{batchId}/add-file - this set file, jars, pyFiles and files variables in batch request
  3. GET /batches/{batchId}/start - finish upload and really submit request to Spark
  4. the rest is like standard batch submission

pmsgd avatar May 09 '18 08:05 pmsgd

Codecov Report

Merging #91 into master will decrease coverage by 0.41%. The diff coverage is 53.01%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master      #91      +/-   ##
============================================
- Coverage     71.49%   71.08%   -0.42%     
- Complexity      793      800       +7     
============================================
  Files            97       97              
  Lines          5402     5474      +72     
  Branches        801      821      +20     
============================================
+ Hits           3862     3891      +29     
- Misses         1019     1049      +30     
- Partials        521      534      +13
Impacted Files Coverage Δ Complexity Δ
.../apache/livy/server/batch/CreateBatchRequest.scala 67.64% <100%> (+2.02%) 20 <1> (+2) :arrow_up:
...apache/livy/server/batch/BatchSessionServlet.scala 51.85% <28.12%> (-33.87%) 3 <0> (ø)
...la/org/apache/livy/server/batch/BatchSession.scala 78.57% <66.66%> (-7.95%) 18 <7> (+5)
...cala/org/apache/livy/scalaapi/ScalaJobHandle.scala 52.94% <0%> (-2.95%) 0% <0%> (ø)
...main/java/org/apache/livy/rsc/ContextLauncher.java 81.86% <0%> (-2.46%) 18% <0%> (ø)
...ain/java/org/apache/livy/rsc/driver/RSCDriver.java 79.41% <0%> (-0.85%) 42% <0%> (-1%)
...in/java/org/apache/livy/rsc/rpc/RpcDispatcher.java 67% <0%> (+3%) 20% <0%> (+1%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e3f45a0...5c343de. Read the comment docs.

codecov-io avatar May 09 '18 11:05 codecov-io

Thanks for the info. The API usage seems not so straightforward from my point. Let me think a bit on this thing.

jerryshao avatar May 09 '18 11:05 jerryshao

My goal was small addition with minimal changes to current API. So far only batch request is enhanced with delayed flag plus 4 new methods for file uploads. First call (batch create) creates hdfs temp area for uploads and ensures that all files will be deleted after job is finished. Uploads should be definitely separate calls for each file. API could be only single upload method with type parameter but it will completely different from interactive session. I think similar API for batches and interactive sessions is better. Last call is "start" - do you know better way how to let Livy know that all files are uploaded and job could be submitted ?

pmsgd avatar May 09 '18 12:05 pmsgd

I was thinking if we can:

  1. associate file uploading with session creation post request, so that we can handle this in one request, not sure if Http protocol support this.
  2. Don't change the semantics of session creation request, upload files to Livy Server before session creation, and use these files for session creation.

I understand that so far your approach might be the easiest way to achieve this feature, but it also makes the request a little strange, I'm just thinking if there's any better solution for this.

jerryshao avatar May 09 '18 12:05 jerryshao

  1. yes - it is possible (see https://tools.ietf.org/html/rfc7578#section-4.3) but not very common. For better reliability it is better to split upload into separate calls (lack of reconnect support). I don't know how scalatra/livy handle file uploads but very common practice is to load complete file(s) into memory before processing. We can run into memory problems with larger apps.
  2. if file is uploaded before session it can hang on hdfs indefinitely without session. Some timeout is possible but such maintenance service looks like basic rest hdfs API. Even more, interactive session upload should be then also refactored and handled this way (at least as optional upload possibility).

pmsgd avatar May 09 '18 13:05 pmsgd

For the memory issue, I think we can stream the file input into disk, that will handle memory issue.

Can you explain more about this "if file is uploaded before session it can hang on hdfs indefinitely without session"? I'm not sure why it will be hang on hdfs indefinitely?

jerryshao avatar May 10 '18 00:05 jerryshao

Imagine this scenario:

  1. client uploads some file
  2. client crashes
  3. file is uploaded to hdfs but session is never started and file never deleted

After some time hdfs will be filled with these "orphan" files. I think periodic maintenace task for deleting old unused files is worse than one rest call.

pmsgd avatar May 10 '18 07:05 pmsgd

After some time hdfs will be filled with these "orphan" files. I think periodic maintenace task for deleting old unused files is worse than one rest call.

This should not be a big problem. We don't have to upload files to hdfs, we can cache the files locally in LivyServer, when LivyServer starts Spark Application, SparkSubmit will upload the dependencies automatically. Also we can add retention mechanism to clean orphaned files.

jerryshao avatar May 10 '18 07:05 jerryshao

Can you please check how other Spark service handle such file upload issue? like Toree, Spark JobServer, or maybe others.

jerryshao avatar May 10 '18 07:05 jerryshao

Looks like Toree is only for interactive mode without uploads at all. JobServer has something like simple filesystem with upload and delete support.

pmsgd avatar May 10 '18 11:05 pmsgd

Keeping old unused files is bad idea everywhere. New retention mechanism is possible but why not use something what is currently in use and working ? And if we go with uploads independent to batches next step is similar filesystem like in JobServer.

pmsgd avatar May 10 '18 11:05 pmsgd

@zjffdu @ajbozarth can you please comment on this feature, especially on the API design side. Thanks!

jerryshao avatar May 14 '18 03:05 jerryshao

How about we design new APIs for such such usage scenario?

  1. request a new session/batch id from Livy server.
  2. Start preparing new resources(jars, py files, archieves).
  3. submit JSON data to start this session/batch

This is a similar process to submit application on YARN: 1) request a new application id. 2) preparing all the resources to launch application. 3) start application.

What do you think @pmsgd ?

jerryshao avatar May 14 '18 06:05 jerryshao

You almost exactly described this pull request. The only difference is between 1. and 3. where the most batch information are send in 1. and not 3.. 3. is only command to start spark job.

pmsgd avatar May 14 '18 07:05 pmsgd

Though they're similar, the semantics are different. In your proposal you're doing POST/POST/GET, but here is GET/POST/POST.

I was thinking that:

  1. We should add the new API endpoints to support this feature, not just one flag in the request.
  2. We should support both interactive and batch session.
  3. We should manage the dependencies locally in LivyServer, not in HDFS.

jerryshao avatar May 14 '18 08:05 jerryshao

Are you really talking about complete API rework for batches and interactive sessions ? My goal was to keep current API as is, only with minimal backward compatible changes. I definitely have not enough time and scala skills for such big task.

pmsgd avatar May 14 '18 08:05 pmsgd

Yes, I'm thinking of adding a new API set for such usage, since this is quite common in the environment where user cannot submit dependencies to HDFS before launching the session. We can add a version flag to support this new API set. There shouldn't have lots of changes in the background.

jerryshao avatar May 14 '18 08:05 jerryshao

FYI I'm reading through both the discussion and the code on this trying to solidify my position before giving a longer review/opinion

ajbozarth avatar May 14 '18 18:05 ajbozarth

Ok so I've been thinking about it for a while and though I have quite a few issues with the details of this implementation (which I'll submit as a detailed review if decide to move forward with this strategy), I like this strategy better. I'll put my more detailed thoughts into the new JIRA: https://issues.apache.org/jira/browse/LIVY-471

ajbozarth avatar May 18 '18 18:05 ajbozarth

Any progress in this functionality? For me, supporting uploads of jars via the REST API is very desirable.

martinhartig avatar Feb 26 '19 15:02 martinhartig

Hello Martin, no further progress from me - we decided to use another solution.

pmsgd avatar Feb 27 '19 14:02 pmsgd

What is the "another solution"? Can you point me to a PR?

martinhartig avatar Mar 06 '19 16:03 martinhartig

I am not authorized to post this information but it is commercial solution, not open source.

pmsgd avatar Mar 07 '19 10:03 pmsgd

it's a useful feature, with this user can submit local jars to remote spark cluster.

mashuai191 avatar Apr 20 '20 10:04 mashuai191