pooch icon indicating copy to clipboard operation
pooch copied to clipboard

Populate registry via data repository API

Open dokempf opened this issue 2 years ago • 3 comments

Description of the desired feature:

When implementing #318, I realized that the DOI support in pooch currently does not allow to automatically populate the pooch's registry using the API, although the information (incl. checksums) is readily available (checked DataVerse and Figshare, Zenodo should have the same). I would like to add this at least for DataVerse, as only with this feature, the built-in pooch solution is superior to my prior implementation. It would make sense however to do this in a unified way across different data repositories.

Proposal: Add a function registry_from_doi to that performs a dispatch based on the data repository type to implementation functions that map a DOI to a registry dictionary. Sample code:

MYPOOCH = pooch.create(base_url="doi:..", ...)
pooch.registry_from_doi(MYPOOCH)

There are many ways to formulate this interface. I would prefer one that does not need redundant specification of the DOI (so it has the pooch object). Alternatively, a special marker object could be passed to pooch.create's registry= argument.

The above proposal could be combined with a refactoring that abstracts data repositories into objects with a well-defined interface, instead of juggling around with functions. This is a matter of personal taste though.

Are you willing to help implement and maintain this feature?

Yes, I can implement it and will be around for helping with maintenance.

dokempf avatar Jul 25 '22 10:07 dokempf

@dokempf that's a good idea. And it would make downloading from DOIs a bit easier since getting the file name and hash by hand is a bit tedious.

How about:

  • Keep the doi specification in the base_url
  • Add the method Pooch.load_registry_from_doi() that takes no arguments and assumes that the base_url is a DOI (should raise an error if it's not).

This way it's consistent with the current registry population from a file.

The above proposal could be combined with a refactoring that abstracts data repositories into objects with a well-defined interface, instead of juggling around with functions. This is a matter of personal taste though.

It would be best to do the refactoring in a follow up to avoid making this PR too large and hard to review. But the DOI interface could probably be improved, as you've pointed out. One thing is that I would be a bit hesitant to have something coming close to a full blow API wrapper for each repository (which should probably be a separate package that we then use).

Thanks for all the work you've been doing on Pooch!

leouieda avatar Aug 08 '22 15:08 leouieda

Thanks for your comments @leouieda - I agree with everything. I am on parental leave until September 5th, so I will only be able to work on this after that date. I guess we can move on with #318 in the meantime and do the refactoring and feature implementation from this issue after that has been merged.

dokempf avatar Aug 08 '22 17:08 dokempf

👍🏾 sounds good to me. And congrats @dokempf!

leouieda avatar Aug 09 '22 07:08 leouieda