SampleCNNs for Audio Classifications

This repository contains the code that used for the publication below:

Taejun Kim, Jongpil Lee, and Juhan Nam, "Comparison and Analysis of SampleCNN Architectures for Audio Classification" in IEEE Journal of Selected Topics in Signal Processing (JSTSP), 2019.

Contents:

Install Dependencies
Building Datasets
- Music auto-tagging: MagnaTagATune
- Keyword spotting: Speech Commands
- Acoustic scene tagging: DCASE 2017 Task 4
Training a SampleCNN

Dependency Installation

NOTE: The code of this repository is written and tested on Python 3.6.

tensorflow 1.10.X (strongly recommend to use 1.10.X because of version compatibility)
librosa
ffmpeg
pandas
numpy
scikit-learn
h5py

To install the required python packages using conda, run the command below:

conda install tensorflow-gpu=1.10.0 ffmpeg pandas numpy scikit-learn h5py
conda install -c conda-forge librosa

Building Datasets

Download and preprocess a dataset that you want to train a model on.

Music auto-tagging: MagnaTagATune

Edith Law, Kris West, Michael Mandel, Mert Bay and J. Stephen Downie (2009). Evaluation of algorithms using games: the case of music annotation. In Proceedings of the 10th International Conference on Music Information Retrieval (ISMIR).

Create a directory for the dataset and download required one .csv file and three .zip files in the directory data/mtt/raw:

mkdir -p data/mtt/raw
cd data/mtt/raw
wget http://mi.soi.city.ac.uk/datasets/magnatagatune/annotations_final.csv
wget http://mi.soi.city.ac.uk/datasets/magnatagatune/mp3.zip.001
wget http://mi.soi.city.ac.uk/datasets/magnatagatune/mp3.zip.002
wget http://mi.soi.city.ac.uk/datasets/magnatagatune/mp3.zip.003

After download the files, merge and expand the three .zip files:

cat mp3.zip.* > mp3_all.zip
unzip mp3_all.zip -d mp3

Your directory structure should look like this:

data
└── mtt
    └── raw
        ├── annotations_final.csv
        └── mp3
            ├── 0
            ├── ...
            └── f

Finally, segment and convert audios to TFRecords using following command:

python build_dataset.py mtt

Keyword spotting: Speech Commands

Pete Warden (2018). Speech commands: A dataset for limited-vocabulary speech recognition. arXiv:1804.03209.

After create a directory for the dataset, download and expand the dataset in the directory data/scd/raw:

mkdir -p data/scd/raw
cd data/scd/raw
wget http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz
tar zxvf speech_commands_v0.02.tar.gz

Finally, segment and convert audios to TFRecords using following command:

python build_dataset.py scd

Acoustic scene tagging: DCASE 2017 Task 4

Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Benjamin Elizalde, Ankit Shah, Emmanuel Vincent, Bhiksha Raj and Tuomas Virtanen (2017). DCASE 2017 challenge setup: tasks, datasets and baseline system. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017).

mkdir -p data/dcs/raw
cd data/dcs/raw

wget --no-check-certificate -r 'https://docs.google.com/uc?export=download&id=1HOQaUHbTgCRsS6Sr9I9uE6uCjiNPC3d3' -O Task_4_DCASE_2017_training_set.zip
wget --no-check-certificate -r 'https://docs.google.com/uc?export=download&id=1GfP5JATSmCqD8p3CBIkk1J90mfJuPI-k' -O Task_4_DCASE_2017_testing_set.zip
wget https://dl.dropboxusercontent.com/s/bbgqfd47cudwe9y/DCASE_2017_evaluation_set_audio_files.zip

unzip -P DCASE_2017_training_set Task_4_DCASE_2017_training_set.zip
unzip -P DCASE_2017_testing_set Task_4_DCASE_2017_testing_set.zip
unzip -P DCASE_2017_evaluation_set DCASE_2017_evaluation_set_audio_files.zip

wget https://github.com/ankitshah009/Task-4-Large-scale-weakly-supervised-sound-event-detection-for-smart-cars/raw/master/groundtruth_release/groundtruth_weak_label_training_set.csv
wget https://github.com/ankitshah009/Task-4-Large-scale-weakly-supervised-sound-event-detection-for-smart-cars/raw/master/groundtruth_release/groundtruth_weak_label_testing_set.csv
wget https://github.com/ankitshah009/Task-4-Large-scale-weakly-supervised-sound-event-detection-for-smart-cars/raw/master/groundtruth_release/groundtruth_weak_label_evaluation_set.csv

Finally, segment and convert audios to TFRecords using following command:

python build_dataset.py dcs

Training a SampleCNN

You can train a SampleCNN with a block on a dataset that you want. Here are several examples to run training:

# Train a SampleCNN with SE block (default) on MagnaTagATune dataset (music auto-tagging)
python train.py mtt

# Train a SampleCNN with ReSE-2 block on Speech Commands dataset (keyword spotting)
python train.py scd --block rese2

# Train a SampleCNN with basic block on DCASE 2017 Task 4 dataset (acoustic scene tagging
python train.py dcs --block basic

Trained models are saved under log directory with a datetime that you started running. Here is an example of saved model:

log/
    └── 20190424_213449-scd-se/
        └── final-auc_0.XXXXXX-acc_0.XXXXXX-f1_0.XXXXXX.h5

You can see the available options for training using the command below:

$ python train.py -h

usage: train.py [-h] [--data-dir PATH] [--log-dir PATH]
                [--block {basic,se,res1,res2,rese1,rese2}]
                [--amplifying-ratio N] [--multi] [--batch-size N]
                [--momentum M] [--lr LR] [--lr-decay DC] [--dropout DO]
                [--weight-decay WD] [--num-stages N] [--patience N]
                [--num-readers N]
                DATASET [NAME]

Train a SampleCNN.

positional arguments:
  DATASET               Dataset for training: {mtt|scd|dcs}
  NAME                  Name of log directory.

optional arguments:
  -h, --help            show this help message and exit
  --data-dir PATH
  --log-dir PATH        Directory where to write event logs and models.
  --block {basic,se,res1,res2,rese1,rese2}
                        Convolutional block to build a model (default: se,
                        options: basic/se/res1/res2/rese1/rese2).
  --amplifying-ratio N
  --multi               Use multi-level feature aggregation.
  --batch-size N        Mini-batch size.
  --momentum M          Momentum for SGD.
  --lr LR               Learning rate.
  --lr-decay DC         Learning rate decay rate.
  --dropout DO          Dropout rate.
  --weight-decay WD     Weight decay.
  --num-stages N        Number of stages to train.
  --patience N          Stop training stage after #patiences.
  --num-readers N       Number of TFRecord readers.

sampleaudio
sampleaudio copied to clipboard

Metadata

SampleCNNs for Audio Classifications

Dependency Installation

Building Datasets

Music auto-tagging: MagnaTagATune

Keyword spotting: Speech Commands

Acoustic scene tagging: DCASE 2017 Task 4

Training a SampleCNN

← Metadata

Owner

Metadata

sampleaudio sampleaudio copied to clipboard

Metadata

SampleCNNs for Audio Classifications

Dependency Installation

Building Datasets

Music auto-tagging: MagnaTagATune

Keyword spotting: Speech Commands

Acoustic scene tagging: DCASE 2017 Task 4

Training a SampleCNN

← Metadata

Owner

Metadata

sampleaudio
sampleaudio copied to clipboard