dataverse-kubernetes icon indicating copy to clipboard operation
dataverse-kubernetes copied to clipboard

Rserve integration

Open poikilotherm opened this issue 6 years ago • 15 comments

Some ingest functionality does not work without an Rserve server.

Looks like https://github.com/ubc/r-docker is a trustworthy image, coming from University of British Columbia.

Maybe open an issue over there asking what their plans are on supporting and pushing updated images to Docker Hub: https://hub.docker.com/r/ubcctlt/rserve

poikilotherm avatar Nov 04 '19 09:11 poikilotherm

We've integrated Rserve in Dataverse Docker module, I don't know if you want to host a separated Docker images for that: https://github.com/IQSS/dataverse-docker/commit/973cc9c633952e7600715c56f550985814dcf69e

4tikhonov avatar Nov 04 '19 09:11 4tikhonov

IMHO this should be kept apart. I do believe in the UNIX philosophy "do one thing, do it well". This gives more flexibility for people that might want to run their own services, use special flavors, install certain amount of packages, ...

poikilotherm avatar Nov 04 '19 09:11 poikilotherm

Ok, you should contact people from Rserve then.

4tikhonov avatar Nov 04 '19 09:11 4tikhonov

If it helps, I've been happily using Rserve on Dataverse spun up by dataverse-ansible since @donsizemore implemented it over the summer: https://github.com/IQSS/dataverse-ansible/pull/87

Data Explorer didn't work properly without it. It takes time to compile all the R modules so I sometimes comment it out if I don't need the functionality.

pdurbin avatar Nov 04 '19 11:11 pdurbin

https://github.com/IQSS/dataverse-ansible/blob/e09ea347aed27a0e5253d94f3818e3381da8db1d/tasks/rserve.yml#L19-L23 definitely helps :smile:

poikilotherm avatar Nov 04 '19 11:11 poikilotherm

It takes time to compile all the R modules so I sometimes comment it out if I don't need the functionality.

@pdurbin you may also set rserve.install to false =) the role will still place rserve.host et al. in domain.xml to talk to an external R service.

donsizemore avatar Nov 04 '19 11:11 donsizemore

@donsizemore, in the same time it's not really sustainable if Dataverse is relying on an external R service that should do data processing.

4tikhonov avatar Nov 04 '19 12:11 4tikhonov

On a related note, we've considered splitting the "ingest" service out of the Dataverse monolith and into its own microservice: https://github.com/IQSS/dataverse/issues/2331

Not all installations of Dataverse want ingest (I'm thinking of Pete's structural biology datasets) but I suspect most do. 😄

pdurbin avatar Nov 04 '19 12:11 pdurbin

@4tikhonov note that Akio's TRSA branch https://github.com/OdumInstitute/trsa-web/tree/jee8line carves ingest out of Dataverse proper and at present makes it optional to the end user. what would you prefer Dataverse use in addition to or instead of R?

donsizemore avatar Nov 04 '19 12:11 donsizemore

I'd really love to discuss this matter in more depth, but I'm pretty sure this is beyond the scope of this issue.

Maybe some of you guys can open an issue at IQSS/dataverse, so it reaches even more people interested in ingest?

poikilotherm avatar Nov 04 '19 13:11 poikilotherm

@pdurbin : Regarding the R script that runs on Rserve and produces metadata summaries:

  • We now have an updated version that is a Python library, which removes the R dependency.
  • @aaron-lebo who works with @vjdorazio has done a lot of work with it--including analyzing all of the tabular files in the Journal of Politics Dataverse.
  • It is available as a pypi package.
  • Documentation on the JSON output is here: https://tworavens.github.io/TwoRavens/Metadata/
  • We're happy to provide more info on it and invite input on adding useful documentation
    • We wrapped it in a web service a while ago (Django/celery), but for Dataverse purposes, this could be greatly simplified--a basic endpoint with Flask or something in a Docker container

  • Regarding using the output of data as a drop-in replacement for the current Dataverse R script--the JSON has additional data and a slightly different structure--if there's interest, we can include an output flag/function, etc. that outputs the older version.

cc/ @tercer

raprasad avatar Nov 06 '19 12:11 raprasad

@raprasad, I really like this solution as python microservice. Not because we're "at home" with python but because it can be more sustainable in the long term perspective.

4tikhonov avatar Nov 06 '19 12:11 4tikhonov

@raprasad wonderful news! Go @aaron-lebo go!

donsizemore avatar Nov 06 '19 13:11 donsizemore

a slightly different structure

@raprasad is the JSON emitted from your new Python code backward compatible with the JSON emitted from the old/current R code? If not, would it be possible to make it backward compatible? I don't want Data Explorer (my main reason for wanting this JSON) to break if we switch to backward-incompatible JSON produced by new code.

Now that we (finally) have API tests running automatically on "develop" and pull requests (on https://jenkins.dataverse.org thanks to the absolutely heroic efforts of @donsizemore !!! 🎉 🎉 🎉 ), we could start to make assertions on the old/current JSON format so that any backward incompatibilities would be detected. Writing those assertions might be a good first small chunk. If someone wants to create an issue about this at https://github.com/IQSS/dataverse/issues please go ahead! 😄

pdurbin avatar Nov 06 '19 14:11 pdurbin

@pdurbin We will add the backward compatibility to the library. Pleae add comments that may be relevant: https://github.com/TwoRavens/raven-metadata-service/issues/205

raprasad avatar Nov 08 '19 17:11 raprasad