datasets
datasets copied to clipboard
Added Audioset TFDS
Work in progress
- TODO add evaluation segments
Add Dataset
- Dataset Name: Audioset
- Issue Reference: #2399
dataset_info.jsonGist: https://gist.github.com/whatwilliam/ffecb3341ca5c7c7425aca35dd5fef9d
Description
The AudioSet dataset is a large-scale collection of human-labeled 10-second sound clips drawn from YouTube videos. To collect all our data we worked with human annotators who verified the presence of sounds they heard within YouTube segments. To nominate segments for annotation, we relied on YouTube metadata and content-based search.
This dataset takes in a manual directory of 10-second .mp3 files taken from Youtube. It yields Audio using pydub AudioSegment and corresponding labels according to the class_labels_indices.csv provided by Audioset.
I have taken in the balanced_train_segments.csv downloaded the .mp3 files from youtube using youtube-dl and cutting the files into the 10-second segments that I parsed from the same csv file.
Checklist
- [x] Address all TODO's
- [x] Add alphabetized import to subdirectory's
__init__.py - [x] Run
download_and_preparesuccessfully - [x] Add checksums file
- [x] Properly cite in
BibTeXformat - [x] Add passing test(s)
- [x] Add test data
- [x] If using additional dependencies (e.g.
scipy), use lazy_imports (if applicable) - [x] Add data generation script (if applicable)
- [x] Lint code
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).
:memo: Please visit https://cla.developers.google.com/ to sign.
Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.
What to do if you already signed the CLA
Individual signers
- It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.
Corporate signers
- Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
- The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
- The email used to register you as an authorized contributor must also be attached to your GitHub account.
ℹ️ Googlers: Go here for more info.
Similar to #2425
Hi @vijayphoenix , I have finished working on all the TODOs for audioset. I was wondering if there were any updates to merge this pull request with tensorflow_datasets.
From now on, all the new datasets will follow the
One folder per datasetmodel. So, could you please update the PR according to the new Creating a dataset guide?
Hi @vijayphoenix, I've finished resolving the issues with my code that you have pointed out. Regarding the new dataset format, because I've been working in the old format, when I register Audioset according to the guide it gives me an error because audioset has already been registered. Is there any way for me to unregister audioset, or will I have to create a whole new repo?
It would be great to have AudioSet available in TFDS but I guess this PR is quite stale now, two years later. Is there something I could contribute to make this happen, perhaps?