repo2docker icon indicating copy to clipboard operation
repo2docker copied to clipboard

Add IPFS content provider (InterPlanetary File System)

Open d70-t opened this issue 3 years ago • 5 comments

This PR is adding an IPFS content provider (see #1096).

The following builds the requirements.txt example via IPFS:

jupyter-repo2docker QmPjPUTcXeiEdNUMEPusP4rnJNz2YPw1XrYQkp43C96DyS 

Still open: Likely one wants to have an option to configure the list of possible IPFS gateways. E.g. an environment variable?

d70-t avatar Nov 26 '21 17:11 d70-t

Thanks for submitting your first pull request! You are awesome! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly. welcome You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

welcome[bot] avatar Nov 26 '21 17:11 welcome[bot]

Tests are now passing, including a test which actually creates a docker image from IPFS.

There's still the open question regarding customizability of the IPFS gateways to try. In ipfsspec the environment variable IPFSSPEC_GATEWAYS can be used to change the list of gateways. I'd imagine that one likely wants to specify a local one if the datacenter running repo2docker has some gateway(s) running there... What would be a preferable way to handle this in repo2docker?

  • Would we hijack / reuse this variable
  • Would we like to have a new one (e.g. REPO2DOCKER_IPFS_GATEWAY)?
  • Something else?

@yuvipanda what do you think?

d70-t avatar Nov 26 '21 20:11 d70-t

A couple of technical issues:

  • repo2docker needs to be usable by novice users, so expecting users to configure or even understand gateways is probably too complicated
  • Just using an alpha-numeric string CID is too generic, e.g. it could coincide with the name of a local directory, or some other future provider that uses a hash may be invented and cause confusion. Does IPFS have a standard URL for referencing objects?

A more general issue which ties in with @betatim's comment on https://github.com/jupyterhub/mybinder.org-deploy/issues/2082#issuecomment-980051998 How widespread is IPFS usage for publishing or referencing code and data repositories? It's clear that IPFS can support reproducibility, but so can many other tools, and repo2docker can't support all of them. Do you have any evidence for how many researchers use it, and what it's unique advantages are over other content providers?

manics avatar Nov 29 '21 22:11 manics

Thanks @manics for the comments!

For the first point: yes, that's also my concern. I believe that IPFS (or any other content addressible storage system) can be a very very useful tool in science, but it is increadibly hard to get people on board initially as there are svereal concepts which are new, creating a large initial burden. My take on the gateway issue would be to offer a somewhat large list of public gateways by default which should just work for anyone and offer a customization point for people wanting to improve performance in this point. (The use of public gateways in case of this PR is easier than for data, as code tends to be smaller and the entire thing can be downloaded by a single request, strongly reducing the pressure created on the gateways).

The second point also came to my mind after writing this. There's the ipfs:// protocol (also listed at IANA) which would be suited for this (and which is also used by ipfsspec). I can modify the PR to require the addition of this protocol prefix.

The third point is of course the hardest one, and maybe also a bit of a chicken and egg problem (see the gateway issue for example). Upfront: I believe that up to now, there are not yet too many researchers working with data on IPFS, but I hope this will change soon and repo2docker / binder would be great accelerators. The reason I started investing time into IPFS is because we've got a lot of data from a recent field campaign (https://eurec4a.eu/), which unfortunately is still distributed across the world partly in very unaccessible places and which should be made accessible for collaborative investigation. To do so, we figured that a basic requirement would be the availability of a simple-looking function which should work for all the data:

ds = get_dataset("some id")

The main goal of this function would be that one can write an analysis script and the script "just works" on any other computer of any coworkers. Thus, this function should work on any computer at any time, even when the primary server would be offline and without changing the identifier (over a long timespan). It should be fast at least if the data is close by (possible on a local disk). And the result of the function may not change over time, because otherwise my analysis wouldn't be reproducible at all. Another point which we have (and want to) deal with is the possibility for having copies of the data at our (and our collaborator's) datacenters. This is partly in order to have better performance and redundancy and partly for political reasons (some data must be held in some countries or at some institutions). For the larger datasets, we also need the possibility for efficient subsetting without downloading (in particular in the context of demonstation scripts), which is often not available at scientific data repositories.

For at least those reasons, I figure that a global content addressable storage system would be very helpful (this provides verifiable data integrity, trivial caching and consequently a simple to implement system of globally distributed copies). IPFS is of course only one possibility, but the implementation of a single global namespace without the immediate need to name the location of the data provider in the dataset identifier (due to a lookup in a distributed hash table) is particularly helpful in the setting outlined above.

Based on this reasoning, I am primarily interested in having datasets on top of IPFS and that's still true. Having code on IPFS would be a neat addition (a single CID would suffice to reference the whole analysis as well as all the data which went into it), but it's probably also fine to have that as a second step. I've made the PR mainly because @yuvipanda suggested it and it seemed to be relatively simple to implement. Also, as the use of IPFS is rather controlled (basically only a single HTTP-request to a gateway), I assumed that the implementation is relatively unproblematic.


Although I'm up to now quite conviced that IPFS will provide a lot of benefits for scientific data storage, I'm also interested in other good and practical solutions. Thus, if there are public data (and code) repositories which cover all of the requirements above, I'd be glad to learn more about them independent of this PR.

d70-t avatar Nov 29 '21 23:11 d70-t

Definitely a chicken and egg problem :smiley: . Personally I don't think repo2docker should take the lead in promoting a particular new/upcoming technology, instead I see it's role as supporting reproducible research using existing well known tools that are already in use by the community.

For example, a very quick Google brought up dat, should we also add support for that?

Perhaps a long term solution to this problem is to make all content providers into plugins so it's easy to extend r2d and experiment with new optional providers. A similar idea has already been suggested for buildpacks.

manics avatar Nov 30 '21 23:11 manics