dataverse icon indicating copy to clipboard operation
dataverse copied to clipboard

Develop external tool infrastructure for running cluster computing jobs from Dataverse

Open landreev opened this issue 11 months ago • 1 comments

(we may come up with a better title for the issue; also, we may move it to a different repo - since most of this will need to be developed outside of the main Dataverse source tree - ?)

This is to continue/build on top of the proof of concept demo that we put together at MOC. The working assumption is that we'll be able to continue using the MOC facilities for this.

This next phase of the effort is to develop the one step that was skipped in the PoC configuration: an intermediate service sitting between the Dataverse and the actual computing nodes. It is similar in function to the redirecting script we have in place for the Binder service. Except we want it to do much more:

  • Rather than simply redirect to another running node, this service will allow a user to specify the parameters for the computing resources they need and spin up an OpenShift pod (?) to run their computations;
  • The resources will be requested and allocated from the user's own budget using their cluster account;
  • We will need to develop better/cleaner library code for obtaining local storage access points for the files in the dataset on the Dataverse instance ("local" should probably include support both for s3, as in the demo setup, and block access). There was some discussion of submitting this code to be included in pyDataverse.
  • The plan is to start with Python - for jobs similar to the notebook used in the demo presentation - with support for more languages (R is the next logical choice) going forward.

We will likely be adding more granular sub-components and opening child issues for such individual tasks.

landreev avatar Mar 18 '24 22:03 landreev