torchgeo icon indicating copy to clipboard operation
torchgeo copied to clipboard

Redistribute datasets and models on Hugging Face

Open adamjstewart opened this issue 2 years ago • 13 comments
trafficstars

Summary

We should consider redistributing as many datasets and pre-trained models as we can on Hugging Face.

Rationale

Hugging Face provides a more reliable centralized repository for storing large binary files. It's a large company, so we don't have to worry about expired SSL certificates or servers going offline. We have full control over the files we upload, so we can make modifications (license permitting) to fix inconsistencies between model architectures.

It also provides significantly faster download speeds compared to similar sites. For example, for our ResNet-50 pre-trained weights (~100 MB):

  • Zenodo: 2 min, 45 sec
  • Hugging Face: 8 sec

For the EuroSAT dataset (~2 GB):

  • DFKI: 4 min, 11 sec
  • Hugging Face: 3 min, 7 sec

Implementation

First, we need to ensure that the dataset or model we are redistributing has a license that permits redistribution. If a license is missing or does not permit redistribution, we should reach out to the authors to see if a permissive license can be granted.

Once licensing is settled, we just need to upload the dataset or model to Hugging Face. The license chosen should match the original license. Any modifications from the original should be clearly documented, and a link should be added to the original source. This is required by many licenses, and is just a good idea to document in general.

Finally, the URL (and possibly MD5) in TorchGeo should be updated to point to the new download location.

Alternatives

We previously used Zenodo for this but download speeds were abysmal. A quick survey of UIUC AI PhD students found that everyone uses Hugging Face 🤗

Additional information

We already have quite a lot of datasets, and dataset authors are often unresponsive to these kinds of inquiries. It's likely unrealistic to expect that we'll be able to redistribute every dataset and model, so I won't start a checklist just yet. High priority datasets and models include:

  • Unable to automatically download
  • Unreliable SSL certificates
  • Very large datasets that flake out during download
  • Slow download speeds

Again, we have to check the license first. Many datasets that cannot be automatically downloaded are for legal reasons.

adamjstewart avatar Jan 31 '23 17:01 adamjstewart

Starting a work-in-progress list so that multiple people don't contact the same person.

Datasets

In-progress

Source License Reason
USAVars Not sure yet Slow and failing download

Completed

Source License Reason
EuroSat EU Law Expired SSL certificate
UC Merced public domain HTTP-only

Models

In-progress

Source License Reason

Completed

Source License Reason
Zhu Lab CC-BY-4.0 Required modifications
ServiceNow Apache-2.0 Required modifications

adamjstewart avatar Jan 31 '23 18:01 adamjstewart

I think DynamicEarthNet is re-distributable (based on a conversation with @lukaskondmann)

calebrob6 avatar Feb 01 '23 06:02 calebrob6

This is correct. DynamicEarthNet is available under this license so redistribution is possible as long as attribution is given

lukaskondmann avatar Feb 01 '23 09:02 lukaskondmann

So2Sat is okay to be mirrored based on https://github.com/microsoft/torchgeo/issues/388.

calebrob6 avatar Feb 03 '23 20:02 calebrob6

From email conversations, OSCD and HRSCD both have CCA licenses which freely allow redistribution.

ReforesTree may require permission from the authors. They have a shared data agreement with WWF. They were able to redistribute on Zenodo, but we should check back with them to see if we can redistribute on Hugging Face.

adamjstewart avatar Feb 20 '23 22:02 adamjstewart

@calebrob6 I would like to redistribute the USAVars dataset if possible because download is super slow and failing several times. However, I am not sure what the actual source of this dataset is since it is only a reproduction. I saw that you had a repo about the paper, so wondering if you know something about the source and license of the torchgeo USAVars dataset?

nilsleh avatar Feb 21 '23 21:02 nilsleh

Hey @nilsleh, yes, I helped create that dataset. We should definitely move it to HuggingFace. @estherrolf is soon going to make changes to the dataset so perhaps we can do that all together.

calebrob6 avatar Feb 21 '23 21:02 calebrob6

I would also like to +1 this. I've been having a ton of issues accessing model weights and files from Radiant Earth and I suspect they are no longer actively maintaining their endpoints.

yeelauren avatar Mar 23 '23 13:03 yeelauren

Hugging Face has a maximum individual file size of 50 GB 😢

adamjstewart avatar Mar 24 '23 14:03 adamjstewart

I would also like to +1 this. I've been having a ton of issues accessing model weights and files from Radiant Earth and I suspect they are no longer actively maintaining their endpoints.

We're aware of these issues, it's due to a combination of issues ranging from architectural limitations to issues with Azure blob storage which haven't been resolved yet. We're working on an updated version of MLHub which resolves these issues which will be available in the near future.

kbgg avatar Mar 24 '23 19:03 kbgg

With #1240 merged, can we move the USAVars dataset to HF? Because at the moment the download keeps failing through torchgeo. I still have the dataset locally, so I could upload it to HF and open a PR to change the download links :) @calebrob6, @estherrolf

nilsleh avatar Jun 21 '23 10:06 nilsleh

USAVars is CC-BY-4.0, so yet we can redistribute on HF if you want.

adamjstewart avatar Feb 29 '24 12:02 adamjstewart