trafficstars

Work in progress

TODO add evaluation segments

Add Dataset

Dataset Name: Audioset
Issue Reference: #2399
dataset_info.json Gist: https://gist.github.com/whatwilliam/ffecb3341ca5c7c7425aca35dd5fef9d

Description

The AudioSet dataset is a large-scale collection of human-labeled 10-second sound clips drawn from YouTube videos. To collect all our data we worked with human annotators who verified the presence of sounds they heard within YouTube segments. To nominate segments for annotation, we relied on YouTube metadata and content-based search.

This dataset takes in a manual directory of 10-second .mp3 files taken from Youtube. It yields Audio using pydub AudioSegment and corresponding labels according to the class_labels_indices.csv provided by Audioset.

I have taken in the balanced_train_segments.csv downloaded the .mp3 files from youtube using youtube-dl and cutting the files into the 10-second segments that I parsed from the same csv file.

Checklist

[x] Address all TODO's
[x] Add alphabetized import to subdirectory's __init__.py
[x] Run download_and_prepare successfully
[x] Add checksums file
[x] Properly cite in BibTeX format
[x] Add passing test(s)
[x] Add test data
[x] If using additional dependencies (e.g. scipy), use lazy_imports (if applicable)
[x] Add data generation script (if applicable)
[x] Lint code

Sep 25 '20 21:09 whatwilliam

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

Sep 25 '20 21:09 googlebot

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

Sep 25 '20 21:09 googlebot

Similar to #2425

Sep 26 '20 12:09 vijayphoenix

Hi @vijayphoenix , I have finished working on all the TODOs for audioset. I was wondering if there were any updates to merge this pull request with tensorflow_datasets.

Oct 14 '20 21:10 whatwilliam

From now on, all the new datasets will follow the One folder per dataset model. So, could you please update the PR according to the new Creating a dataset guide?

Hi @vijayphoenix, I've finished resolving the issues with my code that you have pointed out. Regarding the new dataset format, because I've been working in the old format, when I register Audioset according to the guide it gives me an error because audioset has already been registered. Is there any way for me to unregister audioset, or will I have to create a whole new repo?

Oct 18 '20 04:10 whatwilliam

It would be great to have AudioSet available in TFDS but I guess this PR is quite stale now, two years later. Is there something I could contribute to make this happen, perhaps?

Jul 19 '22 13:07 carlthome

datasets
datasets copied to clipboard

Added Audioset TFDS

Work in progress

Add Dataset

Description