TAPAD
TAPAD copied to clipboard
The Abuse Project Audio Dataset (TAPAD). Think MNIST for audio profanity.
The Abuse Project Audio Dataset (TAPAD)
World's largest profanity audio dataset
Dataset consists of 26,365 audio files
Click here for documentation
See The Abuse Project
TAPAD (∿) is an open dataset, meaning it will grow over time as more data is contributed. In order to enable reproducibility and accurate citation the dataset is versioned using git tags.
Current Status & ID3
Category | Const |
---|---|
Total files | 26,365 |
Dataset updated | July 30, 2019 |
Language classes | 75 |
File Type | MP3 |
Mime Type | audio/mpeg |
Mpeg Audio Version | 2 |
Audio Layer | 3 |
Audio Bitrate | 32 kbps |
Sample Rate | 24000 |
Channel Mode | Single Channel |
Ms Stereo | Off |
Intensity Stereo | Off |
Codec Type | audio |
Codec Time Base | 1/24000 |
Codec Tag | 0x0000 |
Sample Fmt | fltp |
Sample Rate | 24000 |
Channels | 1 |
Channel Layout | mono |
Bits Per Sample | 0 |
R Frame Rate | 0/0 |
Avg Frame Rate | 0/0 |
Time Base | 1/14112000 |
Languages are required to be 2 letters, normally their 2 letter ISO code, see: ISO_639-1
Scripts & Utilities
Filename | Location | Description | Type |
---|---|---|---|
record.py |
acquire\custom |
Records audio in WAV format (default: 3 sec) | Helper script |
wingen.py |
acquire\generate |
TTS conversion using SAPI.SpVoice |
Helper script |
gTTSgen.py |
acquire\generate |
TTS conversion using gTTS & abuse 0.1.1 |
Helper script |
gspectogram.py |
utils |
Generates spectrogram of a wav file | Utility tool |
Structure
.
├───af
├───ar
├───bn
├───bs
├───ca
├───cs
├───cy
├───da
├───de
├───el
├───en
│ ├───1 (340 wav files)
│ └───2
├───en-au
├───en-ca
├───en-gb
├───en-gh
├───en-ie
├───en-in
├───en-ng
├───en-nz
├───en-ph
├───en-tz
├───en-uk
├───en-us
├───en-za
├───eo
├───es
├───es-es
├───es-us
├───et
├───fi
├───fr
├───fr-ca
├───fr-fr
├───hi
├───hr
├───hu
├───hy
├───id
├───is
├───it
├───ja
├───jw
├───km
├───ko
├───la
├───lv
├───mk
├───ml
├───mr
├───my
├───ne
├───nl
├───no
├───pl
├───pt
├───pt-br
├───pt-pt
├───ro
├───ru
├───si
├───sk
├───sq
├───sr
├───su
├───sv
├───sw
├───ta
├───te
├───th
├───tl
├───tr
├───uk
├───vi
├───zh-cn
└───zh-tw
Most of these audio classes have 347 MP3 files of ~5.783 minutes each. MP3 had a lot of patent issues but according to Wikipedia, "If the longest-running patent mentioned in the aforementioned references is taken as a measure, then the MP3 technology became patent-free in the United States on 16 April 2017 when U.S. Patent 6,009,399, held by and administered by Technicolor, expired".
Checking files
find audio/ -type f | wc -l
Made with TAPAD
Did you use or saw TAPAD in a paper, project or app? Add it here!
- The Abuse Project
- (...)
Maintainers
The dataset is regularly updated and maintained by,
- Piyush Raj (@0x48piraj)
Useful Resources
The textual data was collected was from different places which all have been listed below,
- Offensive/Profane Word List from Luis von Ahn's Research Group at Carnegie Mellon University
- The Alphabet Of Swearing
LICENSE
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
To view a copy of this license, visit NC-SA 4.0 or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.