mirdata icon indicating copy to clipboard operation
mirdata copied to clipboard

Require datasets to be in mir-datasets db before adding loader

Open alastair opened this issue 5 years ago • 13 comments

Just thinking out loud, as a way to try and improve the dissemination of mir datasets. There has already been some work in the ismir community to make a public list of datasets: https://github.com/ismir/mir-datasets It would be nice to require as part of the checklist of accepting a new dataset/loader here to require that the dataset is also added to mir-datasets so that it appears on the list at http://www.audiocontentanalysis.org/data-sets/

alastair avatar Nov 11 '19 13:11 alastair

I like this idea a lot. We can also create a 'checklist' of each of the datasets we've included and which are not yet included to encourage new contributions :)

rabitt avatar Nov 14 '19 14:11 rabitt

We'll embed the mir-datasets.yaml file on the mirdata repo, and add a column that indicates if the dataset has a mirdata loader or not. Also as part of the checklist for adding a loader you will now have to indicate the loader availability in this yaml file.

magdalenafuentes avatar Jan 30 '20 15:01 magdalenafuentes

Adding a to-do list here:

  • [x] Audit if our supported datasets are on the mir_datasets list
  • [x] add missing datasets
  • [x] update contributing docs

rabitt avatar Mar 05 '20 20:03 rabitt

TinySOL was missing from the list. I requested to add it. https://github.com/ismir/mir-datasets/pull/25

DALI is also an open PR: https://github.com/ismir/mir-datasets/pull/23

lostanlen avatar Mar 06 '20 15:03 lostanlen

Hmm yeah it looks like the turnaround time for adding to the ismir/mir-datasets repo is a bit long, and it might not makes sense for this to be a strict requirement. Maybe we can ask for admin access to that repo so we can approve/add repos to the list?

rabitt avatar Mar 10 '20 17:03 rabitt

yes, i would be in favor of that. at the moment, faroit and ejhumphrey are the only contributors, and there have been no commits since November 15th, 2019.

lostanlen avatar Mar 10 '20 19:03 lostanlen

DALI and Tinysol will be in mir-datasets once https://github.com/ismir/mir-datasets/pull/27 get merged

lostanlen avatar Apr 05 '20 22:04 lostanlen

This didn't go to far at the end. Thoughts on this moving forward? @rabitt @nkundiushuti

magdalenafuentes avatar Nov 20 '20 10:11 magdalenafuentes

Since the process of creating a loader is slower than adding a dataset in a yaml file, I think we can do a check before merging the PR We should do in addition:

  • include this in the contributing
  • I can check if the current datasets in mirdata are in the list

nkundiushuti avatar Nov 20 '20 15:11 nkundiushuti

I agree we should do this. One concern for this requirement was that we don't have control over the speed of merging in the ismir/mir-datasets repository, and we didn't want this to block mirdata PRs. Maybe a solution is we could see if the maintainers are willing to let us be admins of ismir/mir-datasets as well?

rabitt avatar Nov 24 '20 00:11 rabitt

Missing entries now are: Beatport EDM key, cante100, Mridangam Stroke, Saraga. Small inconsistencies: GiantSteps tempo has audio available though in the list it says it doesn't.

magdalenafuentes avatar Dec 14 '20 15:12 magdalenafuentes

Maybe we can do this every release instead of every time we add a loader?

magdalenafuentes avatar Dec 14 '20 16:12 magdalenafuentes

I'm removing this one from the release and will do right after

magdalenafuentes avatar Dec 18 '20 15:12 magdalenafuentes