mirdata
mirdata copied to clipboard
OpenMic2018
https://zenodo.org/record/1432913#.XqAkvdMzZ24 https://github.com/cosmir/openmic-2018
This is embarrassing :grin: mind if I take a crack at it?
Please do!
Working on this in between ismir sessions today. One question: what do the benevolent maintainers think about using pandas instead of raw csv munging? I don't want to add unnecessary dependency bloat, but the openmic annotations and metadata are stored in a few (large) CSV files that would be much easier to process and align if loaded as dataframes. (Plus I have general misgivings about writing your own csv parser.)
... actually I just realized that there's already an implicit dependency on pandas through jams.
Ok, I have something approaching a prototype working. Before I go much further with it, I want to solicit some feedback.
When you dig into it, openmic is a fairly complex dataset, and I'm not sure how much of it makes sense to expose through the mirdata API. So far, what I have is able to expose raw audio, pre-computed vggish features (similar to how datacos does it), and a big whack of metadata largely imported from FMA. (I haven't added the properties for these yet.)
The labels are pulled from the "aggregated-labels.csv" file, which only encodes the mean ratings for observed instrument/clip interactions. This is a continuous (but generally quantized) value between 0 and 1 for label absence or presence; otherwise a nan is used to indicate a lack of observation.
We also have the number of raters available, but I think it's not worth exposing that. Most of the time it's 3; there are a handful of 1s, and a long tail that reaches up into the hundreds. (I believe these were our honeypot examples.)
All of this, together with the pre-generated partition, gives a track metadata structure like the following:
In [112]: df['000046_3840']
Out[112]:
{
'track_id': 46,
'album_id': 4.0,
'album_title': 'Niris',
'album_url': 'http://freemusicarchive.org/music/Chris_and_Nicky_Andrews/Niris/',
'artist_id': 4,
'artist_name': 'Nicky Cook',
'artist_url': 'http://freemusicarchive.org/music/Chris_and_Nicky_Andrews/',
'artist_website': nan,
'license_image_file': 'http://i.creativecommons.org/l/by-nc-nd/3.0/88x31.png',
'license_image_file_large': 'http://fma-files.s3.amazonaws.com/resources/img/licenses/by-nc-nd.png',
'license_parent_id': nan,
'license_title': 'Attribution-NonCommercial-NoDerivatives (aka Music Sharing) 3.0 International',
'license_url': 'http://creativecommons.org/licenses/by-nc-nd/3.0/',
'tags': '[]',
'track_bit_rate': 256000.0,
'track_comments': 0,
'track_composer': nan,
'track_copyright_c': nan,
'track_copyright_p': nan,
'track_date_created': '11/26/2008 01:49:53 AM',
'track_date_recorded': '1/01/2008',
'track_disc_number': 1,
'track_duration': '01:44',
'track_explicit': nan,
'track_explicit_notes': nan,
'track_favorites': 0,
'track_file': 'music/WFMU/Nicky_Cook__Chris_Andrews/Niris/Nicky_Cook__Chris_Andrews_-_08_-_Yosemite.mp3',
'track_genres': [
{'genre_id': '76', 'genre_title': 'Experimental Pop', 'genre_url': 'http://freemusicarchive.org/genre/Experimental_Pop/'},
{'genre_id': '103', 'genre_title': 'Singer-Songwriter', 'genre_url': 'http://freemusicarchive.org/genre/Singer-Songwriter/'}
],
'track_image_file': 'https://freemusicarchive.org/file/images/albums/Chris_and_Nicky_Andrews_-_Niris_-_2009113012134556.jpg',
'track_information': nan,
'track_instrumental': 0,
'track_interest': 252,
'track_language_code': 'en',
'track_listens': 171,
'track_lyricist': nan,
'track_number': 8,
'track_publisher': nan,
'track_title': 'Yosemite',
'track_url': 'http://freemusicarchive.org/music/Chris_and_Nicky_Andrews/Niris/Yosemite',
'start_time': 3.84,
'split': 'train',
'accordion': nan,
'banjo': nan,
'bass': nan,
'cello': nan,
'clarinet': 0.1710499999999999,
'cymbals': nan,
'drums': nan,
'flute': 0.0,
'guitar': nan,
'mallet_percussion': nan,
'mandolin': nan,
'organ': nan,
'piano': nan,
'saxophone': nan,
'synthesizer': nan,
'trombone': nan,
'trumpet': 0.0,
'ukulele': nan,
'violin': nan,
'voice': nan
}
Note that the openmic labels are the last twenty fields, preceded by the split identifier (train or test), and then the FMA metadata.
So far so good. The question-mark for me is what to do with the disaggregated rating data. This would be super useful to have for crowdsourcing research (less so for the instrument recognition task, unless you're really digging into personalized models), so I'd like to include it if possible. It looks something like the following:
In [64]: df = pd.read_csv('openmic-2018-individual-responses.csv', index_col=0)
In [65]: df.head(10)
Out[65]:
worker_id worker_trust channel instrument response
sample_key
000046_3840 b1281110 0.8146 a163 flute 0.0
000046_3840 67a2a2bf 0.9091 a163 trumpet 0.0
000046_3840 9c5f715c 0.9167 a163 trumpet 0.0
000046_3840 dddd907a 0.7273 125a trumpet 0.0
000046_3840 892f3c66 0.7692 a163 clarinet 0.0
000046_3840 af7e56ee 0.8000 a163 clarinet 1.0
000046_3840 91cdd5c5 0.7708 125a flute 0.0
000046_3840 68de85ac 0.7207 a163 flute 0.0
000046_3840 2edc8001 0.7692 125a clarinet 0.0
000135_483840 d975913e 1.0000 125a voice 1.0
...
The issue is that it's rather large and unwieldy, so it might not be easy to pack into the per-track metadata object. (It also might not be appropriate to do so, eg if you wanted to query by annotator instead of by track.) Since different tracks will have different numbers of ratings, we can't just tack on additional columns. We might be able to do some kind of pivot/aggregate to pack all the ratings for a particular track into a nested object (kind of like we do with genre tags above) but it feels clumsy to me. Is there precedent elsewhere in mirdata for this sort of thing? Or do the benevolent maintainers have thoughts on how to proceed?