mirdata Download for very large datasets should be by instruction

Download for very large datasets should be by instruction

Open rabitt opened this issue 3 years ago • 2 comments

For very large datasets (e.g. > 20 GB), command line download does not make sense. In this case, we should instead give download instructions (or possibly a standalone download script?), like we do for unavailable datasets.

TODO - check our existing datasets, and update contributing instructions to add a non-standard case.

Apr 12 '21 14:04 rabitt

I agree 10%% for datasets that have super large files, but I don't see a problem if the dataset is composed of multiple small files (~2GB) ? In that case if something fails the user can download only the remaining files with partial_download right?

Apr 18 '21 01:04 magdalenafuentes

I don't see a problem if the dataset is composed of multiple small files (~2GB) ? In that case if something fails the user can download only the remaining files with partial_download right?

Yep, agree!

Apr 18 '21 11:04 rabitt

ok so I guess some existing datasets should have downloading disabled and we should print the instructions when download is called?

datacos
mtg-jamendo
AcousticBrainz which other ones are problematic?

Jan 31 '23 19:01 nkundiushuti

after looking into the code:

datacos has multiple parts
mtg-jamendo does not have the download enabled
AcousticBrainz has multiple parts

Feb 01 '23 08:02 nkundiushuti

mirdata mirdata copied to clipboard

Download for very large datasets should be by instruction

mirdata
mirdata copied to clipboard