audio-tagging
                                
                                 audio-tagging copied to clipboard
                                
                                    audio-tagging copied to clipboard
                            
                            
                            
                        Easy to use Audio Tagging in PyTorch
Audio Classification, Tagging & Sound Event Detection in PyTorch
Progress:
- [x] Fine-tune on audio classification
- [ ] Fine-tune on audio tagging
- [ ] Fine-tune on sound event detection
- [x] Add tagging metrics
- [ ] Add Tutorial
- [x] Add Augmentation Notebook
- [ ] Add more schedulers
- [ ] Add FSDKaggle2019 dataset
- [ ] Add MTT dataset
- [ ] Add DESED
- [ ] Test in real-time
Model Zoo
AudioSet Pretrained Models
| Model | Task | mAP (%) | Sample Rate (kHz) | Window Length | Num Mels | Fmax | Weights | 
|---|---|---|---|---|---|---|---|
| CNN14 | Tagging | 43.1 | 32 | 1024 | 64 | 14k | download | 
| CNN14_16k | Tagging | 43.8 | 16 | 512 | 64 | 8k | download | 
| CNN14_DecisionLevelMax | SED | 38.5 | 32 | 1024 | 64 | 14k | download | 
Note: These models will be used as a pretrained model in the fine-tuning tasks below. Check out audioset-tagging-cnn, if you want to train on AudioSet dataset.
Fine-tuned Classification Models
| Model | Dataset | Accuracy (%) | Sample Rate (kHz) | Weights | 
|---|---|---|---|---|
| CNN14 | ESC50 (Fold-5) | 95.75 | 32 | download | 
| CNN14 | FSDKaggle2018 (test) | 93.56 | 32 | download | 
| CNN14 | SpeechCommandsv1 (val/test) | 96.60/96.77 | 32 | download | 
Fine-tuned Tagging Models
| Model | Dataset | mAP(%) | AUC | d-prime | Sample Rate (kHz) | Config | Weights | 
|---|---|---|---|---|---|---|---|
| CNN14 | FSDKaggle2019 | - | - | - | 32 | - | - | 
Fine-tuned SED Models
| Model | Dataset | F1 | Sample Rate (kHz) | Config | Weights | 
|---|---|---|---|---|---|
| CNN14_DecisionLevelMax | DESED | - | 32 | - | - | 
Supported Datasets
| Dataset | Task | Classes | Train | Val | Test | Audio Length | Audio Spec | Size | 
|---|---|---|---|---|---|---|---|---|
| ESC-50 | Classification | 50 | 2,000 | 5 folds | - | 5s | 44.1kHz, mono | 600MB | 
| UrbanSound8k | Classification | 10 | 8,732 | 10 folds | - | <=4s | Vary | 5.6GB | 
| FSDKaggle2018 | Classification | 41 | 9,473 | - | 1,600 | 300ms~30s | 44.1kHz, mono | 4.6GB | 
| SpeechCommandsv1 | Classification | 30 | 51,088 | 6,798 | 6,835 | <=1s | 16kHz, mono | 1.4GB | 
| SpeechCommandsv2 | Classification | 35 | 84,843 | 9,981 | 11,005 | <=1s | 16kHz, mono | 2.3GB | 
| FSDKaggle2019* | Tagging | 80 | 4,970+19,815 | - | 4,481 | 300ms~30s | 44.1kHz, mono | 24GB | 
| MTT* | Tagging | 50 | 19,000 | - | - | - | - | 3GB | 
| DESED* | SED | 10 | - | - | - | 10 | - | - | 
Notes:
*datasets are not available yet. Classification dataset are treated as multi-class/single-label classification and tagging and sed datasets are treated as multi-label classification.
Dataset Structure (click to expand)
Download the dataset and prepare it into the following structure.
datasets
|__ ESC50
    |__ audio
|__ Urbansound8k
    |__ audio
|__ FSDKaggle2018
    |__ audio_train
    |__ audio_test
    |__ FSDKaggle2018.meta
        |__ train_post_competition.csv
        |__ test_post_competition_scoring_clips.csv
|__ SpeechCommandsv1/v2
    |__ bed
    |__ bird
    |__ ...
    |__ testing_list.txt
    |__ validation_list.txt
Augmentations (click to expand)
Currently, the following augmentations are supported. More will be added in the future. You can test the effects of augmentations with this notebook
WaveForm Augmentations:
- [x] MixUp
- [x] Background Noise
- [x] Gaussian Noise
- [x] Fade In/Out
- [x] Volume
- [ ] CutMix
Spectrogram Augmentations:
- [x] Time Masking
- [x] Frequency Masking
- [x] Filter Augmentation
Usage
Requirements (click to expand)
- python >= 3.6
- torch >= 1.8.1
- torchaudio >= 0.8.1
Other requirements can be installed with pip install -r requirements.txt.
Configuration (click to expand)
- Create a configuration file in configs. Sample configuration for ESC50 dataset can be found here.
- Copy the contents of this and then edit the fields you think if it is needed.
- This configuration file is needed for all of training, evaluation and prediction scripts.
Training (click to expand)
To train with a single GPU:
$ python tools/train.py --cfg configs/CONFIG_FILE_NAME.yaml
To train with multiple gpus, set DDP field in config file to true and run as follows:
$ python -m torch.distributed.launch --nproc_per_node=2 --use_env tools/train.py --cfg configs/CONFIG_FILE_NAME.yaml
Evaluation (click to expand)
Make sure to set MODEL_PATH of the configuration file to your trained model directory.
$ python tools/val.py --cfg configs/CONFIG_FILE.yaml
Audio Classification/Tagging Inference
- Set MODEL_PATHof the configuration file to your model's trained weights.
- Change the dataset name in DATASET>>NAMEas your trained model's dataset.
- Set the testing audio file path in TEST>>FILE.
- Run the following command.
$ python tools/infer.py --cfg configs/CONFIG_FILE.yaml
## for example
$ python tools/infer.py --cfg configs/audioset.yaml
You will get an output similar to this:
Class                     Confidence
----------------------  ------------
Speech                     0.897762
Telephone bell ringing     0.752206
Telephone                  0.219329
Inside, small room         0.20761
Music                      0.0770325
Sound Event Detection Inference
- Set MODEL_PATHof the configuration file to your model's trained weights.
- Change the dataset name in DATASET>>NAMEas your trained model's dataset.
- Set the testing audio file path in TEST>>FILE.
- Run the following command.
$ python tools/sed_infer.py --cfg configs/CONFIG_FILE.yaml
## for example
$ python tools/sed_infer.py --cfg configs/audioset_sed.yaml
You will get an output similar to this:
Class                     Start    End
----------------------  -------  -----
Speech                      2.2    7
Telephone bell ringing      0      2.5
The following plot will also be shown, if you set PLOT to true:

References (click to expand)
- https://github.com/qiuqiangkong/audioset_tagging_cnn
- https://github.com/YuanGongND/ast
- https://github.com/frednam93/FilterAugSED
- https://github.com/lRomul/argus-freesound
Citations (click to expand)
@misc{kong2020panns,
      title={PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition}, 
      author={Qiuqiang Kong and Yin Cao and Turab Iqbal and Yuxuan Wang and Wenwu Wang and Mark D. Plumbley},
      year={2020},
      eprint={1912.10211},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}
@misc{gong2021ast,
      title={AST: Audio Spectrogram Transformer}, 
      author={Yuan Gong and Yu-An Chung and James Glass},
      year={2021},
      eprint={2104.01778},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}
@misc{nam2021heavily,
      title={Heavily Augmented Sound Event Detection utilizing Weak Predictions}, 
      author={Hyeonuk Nam and Byeong-Yun Ko and Gyeong-Tae Lee and Seong-Hu Kim and Won-Ho Jung and Sang-Min Choi and Yong-Hwa Park},
      year={2021},
      eprint={2107.03649},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}