renku-python
renku-python copied to clipboard
Allow exporting of whole renku projects
Currently we only allow exporting datasets to external providers. It'd be really useful if a user could export a whole renku project, with datasets, code files etc., for instance for publication.
https://researchobject.github.io/ro-crate/ would be a nice format for exporting a snapshot of a project, but ideally we'd want to export the whole project including the lineage in some form. This would likely take the form of combining ro-crate with the renku-ontology in some way.
We probably wouldn't want to export the whole git history, so the lineage export would just be for provenance and explainability, but wouldn't allow for an exported project to be imported and allow reexecuting workflows etc. (though we could envision something like that in the future when we're less dependent on git).
thanks Ralf!
I think as a minimum starting point:
In my mind, the only differences right now between exporting the whole project as a zip and turning it into some standard format like ro-crate is:
- providing metadata (e.g. title, authors, etc) that is expected by the data store along with the zipped up repo
- providing a way to parse a renku project in a data store into a renku project (e.g. take the zip as is (which should be expected to be a zipped up renku project) and put the metadata from the data store wherever it belongs
More from the users: https://renku.discourse.group/t/is-it-possible-to-publish-code-through-renku/59/7
From the design meeting:
- first pass: export as a zip w/some of the renku metadata to populate the fields (title, authors, etc.) & include link to the renku project url so that interested parties can fork from there or be added to the project
- discussed ro-crate as a second pass
- tabled "importing code into project from DOI" discussion in favor of instructing people to visit the project on renkulab to make use of it
https://sdsc.atlassian.net/wiki/spaces/RENKU/pages/509018119/2020-06-11+Design+Meeting+-+Exporting+Publishing+Renku+Projects
An addition to above: exporting a tarball of the docker image that can be used later to recreate the runtime needed for the project. See here for a discussion.
To reiterate a side comment here about publishing/exporting. From Whole Tale's side, we spoke with @mbjones from DataONE about potentials ways to publish docker images to DataONE. Unfortunately, docker images can be quite large and as a whole can be quite inefficient to store.
I think the closest we came was an idea where we take advantage of two things:
- Docker's ability to layer images
- DataONE's ability to reference a data object across multiple packages (a data file in Package A can be pointed to by Package B - or more correctly: A data file can be referenced by Package A or Package B).
The idea is to upload base image layers a single time. Let's say one for Ubuntu 18 and one for Ubuntu 20. Then, upload another layer that contains RStudio 4.0, another layer for RStudio 3.0, and another for Jupyter x.y.z. When a user goes to upload their docker file, they should only upload the layers that are unique to their docker image. Then, the base layers are referenced to reconstruct their image. This removes redundant uploads of large base images.
In Whole Tale, we ended up running our own registry and said that wherever the Whole Tale instance is deployed it should persist until the end of time. For example, DataONE could run an instance of Whole Tale and because DataONE will be around for a very long time-the images should be safe.
I think the ideal solution is where data repositories/journals host their own registries and images are pushed there however, this hasn't quite caught on yet :)
The Whole Tale project had to deal with the issue of exporting 'Tales' to disk. exports to disk, which was a big task that required looking at what kind of information is useful. You may find our approach useful for inspiration in your own. Note that you may want to have the ability to import exported projects; this of course affects which kind of information is exported with the project. Whole Tale stores 'Tale' information in an RDF knowledge graph and include a run.sh file that spins up a docker image and mounts the local data inside. If you want to see more about how it works, we've summarized it here.
There are a number of specifications that could be used for inspiration. RO-Crate seems to be a popular RDF/JSON-LD serialization. schema,org is of course the latest and greatest ontology for describing things in RDF; science on schema,org is trying to adopt guidelines for using schema,org to describe science objects. Nailing every piece of metadata down to an ontology was definitely time consuming-and there are still unknowns (for example, there's no agreed upon way of describing dockerfiles with a controlled vocabulary). If you have any questions or are thinking about taking one of these routes for exporting, I'd be happy to help out with some of the work!
Thanks for the insight @ThomasThelen and for bringing the conversation to a more reasonable place! :)
The idea of only sending unique layers seems like you would just be re-implementing what a docker registry does very well already. Unless you work in a very controlled environment, ensuring that everyone is using exactly the same base layers of some kind is probably impossible.
I like the idea of bundling a small script that would basically make the project "self-inflate" after fetching it from an archive or some long-term repository like zenodo or dataverse. That could include turning the tarball of the docker image into an actual image and adding it locally to docker so it's ready to go. I think for Renku the use-case is definitely very much that we would want to be able to import/retrieve exported projects and make them functional.
We have indeed considered RO-Crate (#1328) because the vocabularies they use are very close to what we have in Renku already. The exception are workflow specifications - we ended up adding a small set of terms ourselves because none of the existing ontologies we found really suited the needs we had.
One question I've had that comes to mind with RO-Crate is whether using that exact spec is necessary if we are using JSON-LD as a main method of storing metadata already anyway. The metadata they deem useful is almost certainly not going to be entirely sufficient and will need to be extended. It also makes certain assumptions about audience, for instance, using the bioschemas workflow spec. I could see the utility in conforming to RO-Crate if there was tooling around it either user-facing or from the side of data repositories. We're planning on saving the whole KG for each project as a flattened JSON-LD already anyway - so is that enough or do we get some extra perks if we choose to conform to RO-Crate? We've gone through the trouble of figuring out a way to represent everything in the platform as RDF and in principle any system built to consume semantic metadata should be able to work with it (since it's mostly schema.org + PROV-O + some extras, not unlike RO-Crate...). I'm wondering if you had similar considerations with WT?
Dockerfile descriptions are also problematic, as you say - I've seen some chatter on the schema.org github issues about this and some efforts in the OCI world to standardize this but I'm not sure how far it's gone. In Renku we would ideally be able to include computational environments as top-level entities that one could search for and (re)use just like data and code.
The idea of only sending unique layers seems like you would just be re-implementing what a docker registry does very well already
Completely valid point.
I think at the end of the day we were trying to conform to a specification that already existed. We weren't able to find a perfect solution. Bagit-RO seemed like the best fit for us because we wanted something that tied the metadata+filesystem format together and we decided early on that we'd be serving bags. This ended up working nicely when we brought BD-Bag in which gave us the ability to efficiently transfer large Tales while keeping compatibility. We considered what was then called RO-Lite (rebranded to RO-Crate later) but I don't think we saw much utility in refactoring. It also looks like they've depreciated much of the RO stack between the time we implemented it and now, along with all of the tooling built around it. I think one of the biggest hurdles is the way they represent provenance. Right now I don't see any repositories supporting provenance in schema,org, which RO-Crate requires. DataONE supports ProvONE and will render a graphical representation. I believe Dataverse also requests that the provnance be in Prov/ProvONE. In summary, I would say that using RO-Crate isn't necessary and might be a pain point given how new it is, its focus on schema,org, and lack of support from data repositories. If I were to re-do Tale exporting in a vacuum it would most likely be a BagIt format with a JSON-LD file using schema,org + reasonable extensions.
There's an Ro-Crate issue centered around describing Dockerfiles; you may find some inspiration there. Whole Tale is definitely open to describing containers using some sort of standardized format however, it's fairly low priority at this point and we'd also have to additionally convert our repo2docker config files to RDF.
tabled "importing code into project from DOI" discussion in favor of instructing people to visit the project on renkulab to make use of it
I'd like to point out that another way of doing this (when it's un-tabled) is by doing third party integration with Zendo. For example, here is a Tale published to Zenodo, which has the text Run this Tale on Whole Tale by clicking here.
. Clicking will bring the user to Whole Tale and import the Zenodo artifact as a Tale. This same integration is done with DataONE (see the Analyze
dropdown menu here. I believe Zenodo has this capability built in-I had to issue a PR to DataONE with hardcoded Whole Tale addresses. Zenodo is a bit easier to integrate with this though). This would support the use case where someone views a published Renku project and can easily re-open it in the system to run it.
I completely agree with you on provenance being the weak link of any of these specs. We've struggled a lot with that, because there is always something slightly off with the vocabularies. I really like ProvONE - it would fit our current model very well and is basically most of the way there since it's just an extension of PROV-O, which is what we started with. But what I want to say is in our provenance description is simply "here is an execution step and it had these inputs and these outputs and it ran this code with these parameters". ProvONE instead gives me "ports" and "channels" and "controllers". We could of course bend the language to fit, but then we'll have to explain to everyone what our implicit mapping is all the time. We went through the same with wf4ever and mostly dropped it as well now for similar reasons and because it's just too complicated for what we need. Now we use a simple custom vocabulary for things that we couldn't quite get to fit (see here if you're interested).
Thanks for linking the example WT zenodo integration, that looks really nice and simple from the user's PoV. For us, the additional complication here is that each Renku project has a full git history so we have to make a choice whether it's "exporting/archiving" or just a snapshot of the state (one commit). Presently our tools for dealing with the provenance are pretty tied to the git history so the snapshot would have limited usability - but we're working on decoupling the KG from git commits.
I'd love to discuss your work with provenance and the Renku ontology (I don't want to derail this issue) at some point; Whole Tale is about to start work on a provenance-centric feature and things like this ontology could come in handy (I think we were planning on creating an ontology with parallel goals). If you're interested, I can ping you when we have something more concrete; I'd like to see if there's any way we can leverage the ontology or at least see if there's some sort of isomorphic representation (it would be great to be able to download a Renku project and Tale from dataverse and query them similarly). It could also provide inspiration/guidelines for future projects.
each Renku project has a full git history so we have to make a choice whether it's "exporting/archiving" or just a snapshot of the state (one commit)
We actually just had to address this issue a few weeks ago (there are a few open pull requests for allowing support for cloning a git repo into a Tale (an important distinction is that Tales aren't git repositories and we don't support relations between commits). I'm really looking forward to seeing how Renku handles this!!
We'd definitely be interested in chatting with you about this because our thinking is still evolving so having additional input is always welcome! We would be very happy if someone else found the ontology useful - our impression is that there is a really big gap here currently if you want to use RDF for practical (computational) provenance tracking, but it could be we are making the wrong assumptions or not looking in the right places. @Panaetius (who started this issue) has been doing most of the heavy lifting on that front.
The issue with all provenance ontologies that I have seen is pretty well illustrated with the PROV-O plan
entity, which has a description of A plan is an entity that represents a set of actions or steps intended by one or more agents to achieve some goals.
All the ontologies are (intentionally) vague about what this plan actually is, how/where it is stored and how it could be reproduced. Usually the intention is that this vague node gets subclassed into something less vague, but there aren't any less vague ontologies that actually do this. So most solutions just end up pointing to a script or executable and leave it at that.
The problems with this are that you now have a provenance of overall inputs and outputs that were used/generated by some vague plan, but you really don't know (in the graph) what happened, what intermediary or non-file parameters were used nor if the current state of the plan is what was originally run. This make the plan almost useless for provenance tracking, you end up with these files were somehow used to create these files at this point in time
instead of these files and parameters were used by this sequence of actions at this point in time, with these intermediary results, to generate these outputs
. And you'd want to know that without having to leave the graph.
Of course this is on a spectrum, on the one end you have the vague plan
concept, on the other end you'd have something like creating a full python ontology that can represent a python script with all variables and functions in a graph form, duplicating the script as an RDF graph. The former is too vague, the latter is too fine-grained in my opinion.
Hence we've tried to come up with a subclass of plan
that can be composed of multiple steps with input/output files and parameters as well what command was executed, with very limited support for also representing the internals of the command that was run. This should strike a middle ground that gives enough context and information for provenance purposes but without littering the graph with info about every single computation done by executables.
Our design goal is to track everything that a user does at the shell level, and to give the user the tools to track additional information about internals as they see fit, as the user is most suited to decide what internals are important for provenance.
@Panaetius You very nicely described the state of affairs with provenance that led us to create ProvONE -- wanting to track inputs and outputs for specific execution events, while enabling a much more detailed exposition of the internals of execution workflows if needed. In reality, in our production systems, we generally only label the inputs, outputs, and execution of scripts/executables, and the internal structure mostly goes undocumented (i.e., we rarely use Ports, Channels, etc). This is partially because it is really hard to not immediately drop into a very low-level representation, which means you are essentially just creating a programming language abstraction over again. Nevertheless, there has been a ton of work done on representing internals in provenance for workflows, and much of what is currently being discussed recapitulates the discussions of 10 years ago around the Open Provenance Model (OPM) and scientific workflow systems. PROV-O keeps many of those concepts, but is a bit more general. So we are sticking with the higher-level description from ProvONE in DataONE now, hoping to at least be able to trace the data flow provenance.
As an aside, if you're interested in the internals, @ludaesch and @tmcphillips put together the YesWorkflow system for documenting internal provenance structure (https://github.com/yesworkflow-org/yw-prototypes). The paper they wrote on it has a nice overview of the issues in retrospective and prospective provenance, and provides good refs to the prior work on this in the scientific workflow community:
T. McPhillips, T. Song, T. Kolisnik, S. Aulenbach, K. Belhajjame, R.K. Bocinsky, Y. Cao, J. Cheney, F. Chirigati, S. Dey, J. Freire, C. Jones, J. Hanken, K.W. Kintigh, T.A. Kohler, D. Koop, J.A. Macklin, P. Missier, M. Schildhauer, C. Schwalm, Y. Wei, M. Bieda, B. Ludäscher (2015). YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts. International Journal of Digital Curation 10, 298-313. DOI: https://doi.org/10.2218/ijdc.v10i1.370
We've worked at making ProvONE and YesWorkflow work together, but its still a hard adoption curve. I think @ludaesch and @tmcphillips are pursuing the idea of effective provenance for researchers further still in new projects.
I'll also mention the integration of C2Metadata SDTL with ProvONE (again, I don't want to pollute this issue if there's a better place for this discussion). The C2Metadata project takes source code and provides detailed metadata about what happened in the script. For example, here is SDTL describing an R script that loads a csv file and then saves it. Note that it has the ability to track inter-variable variables, such as data frames (terms and conditions apply here). I'm currently working on meshing this SDTL representation with ProvONE, so that you not only get a high level overview of which scripts where used/produced-but you also get the ability to query the SDTL for more information about the actual commands that were used to generate said files. It's still in beta, and we're currently hashing out a few issues but here's a sneak peak of what it looks like so far.
Whole Tale will be using ReproZip to produce the higher level metadata. In terms of a "Run" of a Tale, ReproZip provides high level information that can be represented as
As you've mentioned there's also information between the files that's missing there. ReproZip provides a basic level of information about the interaction between files.
But what about the commands that wt:file/2
used to generate wt:file/3
? SDTL allows us to go one step into each file, tracking provenance of variables so that you can ask things like which lines of code contributed to the writing of X file?. This is unfortunately one of the simplest beta examples of combining SDTL and ProvONE together.
To me, the two/three programs (ReproZip &C2Metadata/YesWorkflow) create a bridge to these files and parameters were used by this sequence of actions at this point in time, with these intermediary results, to generate these outputs
. I think I see some similarities between the Renku ontology and the goals of C2Metadata; if I understand it correctly both are describing command-level executions.
It's important to keep in mind that the problem of tools capturing too much (ie recording every syscall) or too little is still an issue and that there isn't a 100% suitable tool out there, although it's probably dependent on your application (maybe someone wants to track every syscall when an R script is run). We were able to work with this model into Whole Tale to fit our use cases; I'll be clear that I'm not suggesting adopting any of this, but that you may find it useful food for thought.
Hi @mbjones thanks for chiming in! The YesWorkflow project seems like a really neat idea - we've been discussing doing something like that with renku for the cases where users wanted to push more information about the code into the knowledge graph.
Just to be clear - we are not, at the moment, capturing "internals" in the sense of what happens inside an individual code (nothing as detailed as what @ThomasThelen describes). Renku is concerned with higher-order provenance on the level of file inputs/outputs and code executions, but by default we treat the code executions as black boxes. Still, we do the provenance capture automatically not just for the purpose of documenting the data flow but to enable reproducibility. For that, we need to capture enough additional information on top of just "input" and "output" (and potentially other parameters) to reconstruct the command-line invocation. It wasn't clear to me how the ProvONE ontology could be used in this case - do you maybe have some examples you could point me to?
That said, we do envision plenty of situation where having more fine-grained metadata about what happens inside the black boxes would be useful. For this, we allow plugins to expose more metadata about a given process in the form of annotations - we are currently building up such a plugin for ML-specific tasks (https://github.com/ratschlab/renku-mls). One interesting option might be to use something like the YesWorkflow as such a plugin.
I just came across this again, as I was quite unhappy to see how zenodo displays the contents of entire renku projects, e.g. https://zenodo.org/records/8213124. The 'Files' section displays all the meaningless hidden and renku files before even getting to the README.md. I don't know if we need to solve all the issues discussed above before creating a way to publish an entire renku project on zenodo in a more meaningful way. All it would take is to populate the 'Description' field with the contents of README.md and add additional text providing the link to the original renkulab.io project and a link to the instructions for how to recreate a renkulab project locally. As a start, we could include the whole history, so that we ensure that the project is usable as intended. Why not, except for size issues?