Audio Classification, Tagging & Sound Event Detection in PyTorch

Progress:

[x] Fine-tune on audio classification
[ ] Fine-tune on audio tagging
[ ] Fine-tune on sound event detection
[x] Add tagging metrics
[ ] Add Tutorial
[x] Add Augmentation Notebook
[ ] Add more schedulers
[ ] Add FSDKaggle2019 dataset
[ ] Add MTT dataset
[ ] Add DESED
[ ] Test in real-time

Model Zoo

AudioSet Pretrained Models

Model	Task	mAP ^(%)	Sample Rate ^(kHz)	Window Length	Num Mels	Fmax	Weights
CNN14	Tagging	43.1	32	1024	64	14k	download
CNN14_16k	Tagging	43.8	16	512	64	8k	download

CNN14_DecisionLevelMax	SED	38.5	32	1024	64	14k	download

Note: These models will be used as a pretrained model in the fine-tuning tasks below. Check out audioset-tagging-cnn, if you want to train on AudioSet dataset.

Fine-tuned Classification Models

Model	Dataset	Accuracy ^(%)	Sample Rate ^(kHz)	Weights
CNN14	ESC50 (Fold-5)	95.75	32	download
CNN14	FSDKaggle2018 (test)	93.56	32	download
CNN14	SpeechCommandsv1 (val/test)	96.60/96.77	32	download

Fine-tuned Tagging Models

Model	Dataset	mAP(%)	AUC	d-prime	Sample Rate ^(kHz)	Config	Weights
CNN14	FSDKaggle2019	-	-	-	32	-	-

Fine-tuned SED Models

Model	Dataset	F1	Sample Rate ^(kHz)	Config	Weights
CNN14_DecisionLevelMax	DESED	-	32	-	-

Supported Datasets

Dataset	Task	Classes	Train	Val	Test	Audio Length	Audio Spec	Size
ESC-50	Classification	50	2,000	5 folds	-	5s	44.1kHz, mono	600MB
UrbanSound8k	Classification	10	8,732	10 folds	-	<=4s	Vary	5.6GB
FSDKaggle2018	Classification	41	9,473	-	1,600	300ms~30s	44.1kHz, mono	4.6GB
SpeechCommandsv1	Classification	30	51,088	6,798	6,835	<=1s	16kHz, mono	1.4GB
SpeechCommandsv2	Classification	35	84,843	9,981	11,005	<=1s	16kHz, mono	2.3GB

FSDKaggle2019*	Tagging	80	4,970+19,815	-	4,481	300ms~30s	44.1kHz, mono	24GB
MTT*	Tagging	50	19,000	-	-	-	-	3GB

DESED*	SED	10	-	-	-	10	-	-

Notes: * datasets are not available yet. Classification dataset are treated as multi-class/single-label classification and tagging and sed datasets are treated as multi-label classification.

Dataset Structure (click to expand)

Download the dataset and prepare it into the following structure.

datasets
|__ ESC50
    |__ audio

|__ Urbansound8k
    |__ audio

|__ FSDKaggle2018
    |__ audio_train
    |__ audio_test
    |__ FSDKaggle2018.meta
        |__ train_post_competition.csv
        |__ test_post_competition_scoring_clips.csv

|__ SpeechCommandsv1/v2
    |__ bed
    |__ bird
    |__ ...
    |__ testing_list.txt
    |__ validation_list.txt

Augmentations (click to expand)

Currently, the following augmentations are supported. More will be added in the future. You can test the effects of augmentations with this notebook

WaveForm Augmentations:

[x] MixUp
[x] Background Noise
[x] Gaussian Noise
[x] Fade In/Out
[x] Volume
[ ] CutMix

Spectrogram Augmentations:

[x] Time Masking
[x] Frequency Masking
[x] Filter Augmentation

Usage

Requirements (click to expand)

python >= 3.6
torch >= 1.8.1
torchaudio >= 0.8.1

Other requirements can be installed with pip install -r requirements.txt.

Configuration (click to expand)

Create a configuration file in configs. Sample configuration for ESC50 dataset can be found here.
Copy the contents of this and then edit the fields you think if it is needed.
This configuration file is needed for all of training, evaluation and prediction scripts.

Training (click to expand)

To train with a single GPU:

$ python tools/train.py --cfg configs/CONFIG_FILE_NAME.yaml

To train with multiple gpus, set DDP field in config file to true and run as follows:

$ python -m torch.distributed.launch --nproc_per_node=2 --use_env tools/train.py --cfg configs/CONFIG_FILE_NAME.yaml

Evaluation (click to expand)

Make sure to set MODEL_PATH of the configuration file to your trained model directory.

$ python tools/val.py --cfg configs/CONFIG_FILE.yaml

Audio Classification/Tagging Inference

Set MODEL_PATH of the configuration file to your model's trained weights.
Change the dataset name in DATASET >> NAME as your trained model's dataset.
Set the testing audio file path in TEST >> FILE.
Run the following command.

$ python tools/infer.py --cfg configs/CONFIG_FILE.yaml

## for example
$ python tools/infer.py --cfg configs/audioset.yaml

You will get an output similar to this:

Class                     Confidence
----------------------  ------------
Speech                     0.897762
Telephone bell ringing     0.752206
Telephone                  0.219329
Inside, small room         0.20761
Music                      0.0770325

Sound Event Detection Inference

Set MODEL_PATH of the configuration file to your model's trained weights.
Change the dataset name in DATASET >> NAME as your trained model's dataset.
Set the testing audio file path in TEST >> FILE.
Run the following command.

$ python tools/sed_infer.py --cfg configs/CONFIG_FILE.yaml

## for example
$ python tools/sed_infer.py --cfg configs/audioset_sed.yaml

You will get an output similar to this:

Class                     Start    End
----------------------  -------  -----
Speech                      2.2    7
Telephone bell ringing      0      2.5

The following plot will also be shown, if you set PLOT to true:

sed_result

References (click to expand)

https://github.com/qiuqiangkong/audioset_tagging_cnn
https://github.com/YuanGongND/ast
https://github.com/frednam93/FilterAugSED
https://github.com/lRomul/argus-freesound

Citations (click to expand)

@misc{kong2020panns,
      title={PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition}, 
      author={Qiuqiang Kong and Yin Cao and Turab Iqbal and Yuxuan Wang and Wenwu Wang and Mark D. Plumbley},
      year={2020},
      eprint={1912.10211},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

@misc{gong2021ast,
      title={AST: Audio Spectrogram Transformer}, 
      author={Yuan Gong and Yu-An Chung and James Glass},
      year={2021},
      eprint={2104.01778},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

@misc{nam2021heavily,
      title={Heavily Augmented Sound Event Detection utilizing Weak Predictions}, 
      author={Hyeonuk Nam and Byeong-Yun Ko and Gyeong-Tae Lee and Seong-Hu Kim and Won-Ho Jung and Sang-Min Choi and Yong-Hwa Park},
      year={2021},
      eprint={2107.03649},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

audio-tagging
audio-tagging copied to clipboard

Metadata

Audio Classification, Tagging & Sound Event Detection in PyTorch

Model Zoo

Supported Datasets

Usage

← Metadata

Owner

Metadata

audio-tagging audio-tagging copied to clipboard

Metadata

Audio Classification, Tagging & Sound Event Detection in PyTorch

Model Zoo

Supported Datasets

Usage

← Metadata

Owner

Metadata

audio-tagging
audio-tagging copied to clipboard