audiomate icon indicating copy to clipboard operation
audiomate copied to clipboard

Reproducibility of dataset downloads

Open faroit opened this issue 4 years ago • 2 comments

When downloading corpora from versioned data stores, I would expect to take into account a tag or specific hash of that dataset. That way users are sure if a specific version of audiomate yields an identical corpus to foster reproducibility.

e.g. lets take the esc-50 corpus: the root url downloads directly from master branch https://github.com/ynop/audiomate/blob/28696c0e46ab1d7f3d3f53f8a9086c724b7b947a/audiomate/corpus/io/esc.py#L11

To improve reproducibility, I suggest that audiomate uses tags where possible (github, zenodo, ...) and furthermore provide a checksum mechanism that verifies a successful download.

This issue is part of a JOSS review https://github.com/openjournals/joss-reviews/issues/2135

faroit avatar May 18 '20 15:05 faroit

Yes, that is a good idea. I also had something similar in mind. Due to some datasets changing "frequently", I wanted to introduce versions. So you could actually select which version you want to use. Of course this should/could be combined with your approach with tags and checksums.

ynop avatar May 18 '20 19:05 ynop

Due to some datasets changing "frequently", I wanted to introduce versions. So you could actually select which version you want to use.

yes, that's also a good idea. and could be added on top of tags. But I think you might end up with a less confusing API if one version of audiomate only supports a fixed amount of dataset versions. Of course, this would come with the drawback that users would be unable to load an older dataset version even though they use a new version of audiomate. But then I think this wouldn't happen in practice as most users would use audiomate for a single dataset per project.

For now, I would suggest to freeze as many downloads as possible instead of pointing to the latest.

faroit avatar May 18 '20 20:05 faroit