mirdata
mirdata copied to clipboard
Require datasets to be in mir-datasets db before adding loader
Just thinking out loud, as a way to try and improve the dissemination of mir datasets. There has already been some work in the ismir community to make a public list of datasets: https://github.com/ismir/mir-datasets It would be nice to require as part of the checklist of accepting a new dataset/loader here to require that the dataset is also added to mir-datasets so that it appears on the list at http://www.audiocontentanalysis.org/data-sets/
I like this idea a lot. We can also create a 'checklist' of each of the datasets we've included and which are not yet included to encourage new contributions :)
We'll embed the mir-datasets.yaml
file on the mirdata repo, and add a column that indicates if the dataset has a mirdata
loader or not. Also as part of the checklist for adding a loader you will now have to indicate the loader availability in this yaml
file.
Adding a to-do list here:
- [x] Audit if our supported datasets are on the
mir_datasets
list - [x] add missing datasets
- [x] update contributing docs
TinySOL was missing from the list. I requested to add it. https://github.com/ismir/mir-datasets/pull/25
DALI is also an open PR: https://github.com/ismir/mir-datasets/pull/23
Hmm yeah it looks like the turnaround time for adding to the ismir/mir-datasets repo is a bit long, and it might not makes sense for this to be a strict requirement. Maybe we can ask for admin access to that repo so we can approve/add repos to the list?
yes, i would be in favor of that. at the moment, faroit and ejhumphrey are the only contributors, and there have been no commits since November 15th, 2019.
DALI and Tinysol will be in mir-datasets once https://github.com/ismir/mir-datasets/pull/27 get merged
This didn't go to far at the end. Thoughts on this moving forward? @rabitt @nkundiushuti
Since the process of creating a loader is slower than adding a dataset in a yaml file, I think we can do a check before merging the PR We should do in addition:
- include this in the contributing
- I can check if the current datasets in mirdata are in the list
I agree we should do this. One concern for this requirement was that we don't have control over the speed of merging in the ismir/mir-datasets repository, and we didn't want this to block mirdata PRs. Maybe a solution is we could see if the maintainers are willing to let us be admins of ismir/mir-datasets as well?
Missing entries now are: Beatport EDM key, cante100, Mridangam Stroke, Saraga. Small inconsistencies: GiantSteps tempo has audio available though in the list it says it doesn't.
Maybe we can do this every release instead of every time we add a loader?
I'm removing this one from the release and will do right after