mirdata icon indicating copy to clipboard operation
mirdata copied to clipboard

Download for very large datasets should be by instruction

Open rabitt opened this issue 3 years ago • 2 comments

For very large datasets (e.g. > 20 GB), command line download does not make sense. In this case, we should instead give download instructions (or possibly a standalone download script?), like we do for unavailable datasets.

TODO - check our existing datasets, and update contributing instructions to add a non-standard case.

rabitt avatar Apr 12 '21 14:04 rabitt

I agree 10%% for datasets that have super large files, but I don't see a problem if the dataset is composed of multiple small files (~2GB) ? In that case if something fails the user can download only the remaining files with partial_download right?

magdalenafuentes avatar Apr 18 '21 01:04 magdalenafuentes

I don't see a problem if the dataset is composed of multiple small files (~2GB) ? In that case if something fails the user can download only the remaining files with partial_download right?

Yep, agree!

rabitt avatar Apr 18 '21 11:04 rabitt

ok so I guess some existing datasets should have downloading disabled and we should print the instructions when download is called?

  • datacos
  • mtg-jamendo
  • AcousticBrainz which other ones are problematic?

nkundiushuti avatar Jan 31 '23 19:01 nkundiushuti

after looking into the code:

  • datacos has multiple parts
  • mtg-jamendo does not have the download enabled
  • AcousticBrainz has multiple parts

nkundiushuti avatar Feb 01 '23 08:02 nkundiushuti