The Abuse Project Audio Dataset (TAPAD)

World's largest profanity audio dataset

Dataset consists of ‭26,365 audio files
Click here for documentation

See The Abuse Project

TAPAD (∿) is an open dataset, meaning it will grow over time as more data is contributed. In order to enable reproducibility and accurate citation the dataset is versioned using git tags.

Current Status & ID3

Category	Const
Total files	`26,365`
Dataset updated	`July 30, 2019`
Language classes	`75`
File Type	MP3
Mime Type	audio/mpeg
Mpeg Audio Version	2
Audio Layer	3
Audio Bitrate	32 kbps
Sample Rate	24000
Channel Mode	Single Channel
Ms Stereo	Off
Intensity Stereo	Off
Codec Type	audio
Codec Time Base	1/24000
Codec Tag	0x0000
Sample Fmt	fltp
Sample Rate	24000
Channels	1
Channel Layout	mono
Bits Per Sample	0
R Frame Rate	0/0
Avg Frame Rate	0/0
Time Base	1/14112000

Languages are required to be 2 letters, normally their 2 letter ISO code, see: ISO_639-1

Scripts & Utilities

Filename	Location	Description	Type
`record.py`	`acquire\custom`	Records audio in WAV format (default: 3 sec)	Helper script
`wingen.py`	`acquire\generate`	TTS conversion using `SAPI.SpVoice`	Helper script
`gTTSgen.py`	`acquire\generate`	TTS conversion using gTTS & `abuse 0.1.1`	Helper script
`gspectogram.py`	`utils`	Generates spectrogram of a wav file	Utility tool

Structure

.
├───af
├───ar
├───bn
├───bs
├───ca
├───cs
├───cy
├───da
├───de
├───el
├───en
│   ├───1 (340 wav files)
│   └───2
├───en-au
├───en-ca
├───en-gb
├───en-gh
├───en-ie
├───en-in
├───en-ng
├───en-nz
├───en-ph
├───en-tz
├───en-uk
├───en-us
├───en-za
├───eo
├───es
├───es-es
├───es-us
├───et
├───fi
├───fr
├───fr-ca
├───fr-fr
├───hi
├───hr
├───hu
├───hy
├───id
├───is
├───it
├───ja
├───jw
├───km
├───ko
├───la
├───lv
├───mk
├───ml
├───mr
├───my
├───ne
├───nl
├───no
├───pl
├───pt
├───pt-br
├───pt-pt
├───ro
├───ru
├───si
├───sk
├───sq
├───sr
├───su
├───sv
├───sw
├───ta
├───te
├───th
├───tl
├───tr
├───uk
├───vi
├───zh-cn
└───zh-tw

Most of these audio classes have 347 MP3 files of ~5.783 minutes each. MP3 had a lot of patent issues but according to Wikipedia, "If the longest-running patent mentioned in the aforementioned references is taken as a measure, then the MP3 technology became patent-free in the United States on 16 April 2017 when U.S. Patent 6,009,399, held by and administered by Technicolor, expired".

Checking files

find audio/ -type f | wc -l

Made with TAPAD

Did you use or saw TAPAD in a paper, project or app? Add it here!

The Abuse Project
(...)

Maintainers

The dataset is regularly updated and maintained by,

Piyush Raj (@0x48piraj)

Useful Resources

The textual data was collected was from different places which all have been listed below,

LICENSE

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

To view a copy of this license, visit NC-SA 4.0 or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

TAPAD
TAPAD copied to clipboard

Metadata

The Abuse Project Audio Dataset (TAPAD)

Current Status & ID3

Scripts & Utilities

Structure

Checking files

Made with TAPAD

Maintainers

Useful Resources

LICENSE

← Metadata

Owner

Metadata

TAPAD TAPAD copied to clipboard

Metadata

The Abuse Project Audio Dataset (TAPAD)

Current Status & ID3

Scripts & Utilities

Structure

Checking files

Made with TAPAD

Maintainers

Useful Resources

LICENSE

← Metadata

Owner

Metadata

TAPAD
TAPAD copied to clipboard