Speech-Command-Recognition-with-Capsule-Network
Speech-Command-Recognition-with-Capsule-Network copied to clipboard
Speech command recognition with capsule network & various NNs / KWS on Google Speech Command Dataset.
End-to-End Speech Command Recognition with Capsule Network
INTERSPEECH 2018 paper: link
We apply the capsule network to capture the spatial relationship and pose information of speech spectrogram features in both frequency and time axes, and show that our proposed end-to-end SR system with capsule networks on one-second speech commands dataset achieves better results on both clean and noise-added test than baseline CNN models.
- 20 JAN 2019: Other baseline Keyword Spotting(KWS) models are also provided in CNN code.
Getting Started
The code is implemented based on python2(2.7.12)
Prerequistes
You should be ready to import below libraries:
tqdm, numpy(1.14.1), termcolor, scipy, sklearn, scikits
tensorflow(1.6.0), keras(2.1.4)
pip install numpy
pip install termcolor
pip install scipy
pip install sklearn
pip install scikit-learn
pip install tensorflow-gpu==1.6.0
pip install keras==2.1.4
Speech Feature Generation
Dataset
We use 'Google Speech Command Dataset'. You could refer to blog and Download Link
- Download the dataset from above link and unzip it. (In our case we will unzip it in the folder named 'Google_Speech_Command')
Adding noise
To add noise to the original dataset, we use MATLAB and voicebox which is MATLAB library. We run matlab code on local which is window base and upload it to server which is linux base.
-
Unzip download google speech command dataset.
-
Create new folder name 'Google_Speech_Command' and move command folders into it. Then the folder structure will be like
speech_commands_v0.01.tar
|-- [_backgorund_noise_]
|-- Google_Speech_Command
| |-- bed
| |-- bird
: :
| '-- zero
|-- testing_list
|-- validation_list
'-- etc.
- Change 'data_path' in matlab code and run the matlab code. It will generate new folder and save the noise added audio files.
noise_wave_generate.m
- You could aslo change 'SNR' in the code and generate noise audio files as you want.
Feature Generation
Extract speech features from raw audio file and save them as .npy file. Please adjust '--noise_name' argument.
cd core
python feature_generation.py
Data folder structure
feature_saved
|-- TEST
| |-- fbank
| | |-- clean
| | '-- [noise names]_SNR5
| '-- label
|-- TRAIN
| |-- fbank
| | |-- clean
| | '-- [noise names]_SNR5
| '-- label
'-- VALID
|-- fbank
| |-- clean
| '-- [noise names]_SNR5
'-- label
Training & Testing
For training and testing go into 'CNN' or 'CapsNet' folder and run the code. You could change the mode with '--is_training' argument.
Training
cd CapsNet
python main.py -m=CapsNet --is_training='TRAIN' -ex='0320_digitvec4' -d=0 --kernel=19 --primary_channel=32 --primary_veclen=4 --digit_veclen=4
Testing
Note that you should set '--keep' argument to the number of epoch that you want to test.
cd CapsNet
python main.py -m=CapsNet --is_training='TEST' -ex='0320_digitvec4' -d=0 --kernel=19 --primary_channel=32 --primary_veclen=4 --digit_veclen=4 --SNR=5 --keep=?
Various Neural Networks base KWS models
KWS models based on various kinds of Neural Networks(NNs) are also provided in CNN/model.py
1. Deep Neural Network(DNN) base KWS model from
- G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks.” in ICASSP, vol. 14. Citeseer, 2014, pp. 4087–4091.
Contain 'ref_2014icassp_dnn' in ex_name to use DNN model. For example
```
python main.py --model='CNN' --ex_name='ref_2014icassp_dnn512' --is_training='TRAIN' --model_size_info 512 512 512
```
2. CNN base KWS model from
- T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
Contain 'ref_2015is_cnn' in ex_name to use CNN model. For example
```
python main.py --model='CNN' --ex_name='ref_2015is_cnn' --is_training='TRAIN' --model_size_info 21 8 94 1 1 2 3 6 4 94 1 1 1 1 32
```
3. Long Short-Term Memory(LSTM) base KWS model form
- M. Sun, A. Raju, G. Tucker, S. Panchapagesan, G. Fu, A. Mandal, S. Matsoukas, N. Strom, and S. Vitaladevuni, “Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016, pp. 474–480.
Contain 'ref_rnn' in ex_name to use LSTM model. For example
```
python main.py --model='CNN' -ex_name=ref_rnn_lstm --is_training='TRAIN' --model_size_info 64 32 0
```
4. Convolutional Recurrent Neural Network(CRNN) base KWS model from
- S. O. Arik, M. Kliegl, R. Child, J. Hestness, A. Gibiansky, C. Fougner, R. Prenger, and A. Coates, “Convolutional recurrent neural networks for small-footprint keyword spotting,” arXiv preprint arXiv:1703.05390, 2017.
Contain 'ref_crnn' in ex_name to use CRNN model. For example
```
python main.py --model='CNN' --ex_name=ref_crnn --is_training='TRAIN' --model_size_info 32 20 5 8 2 2 32 1 64
```
Reference
Preprocessing source code from https://github.com/zzw922cn/Automatic_Speech_Recognition.
Base capsule network keras source code from https://github.com/XifengGuo/CapsNet-Keras.
Authors
Jaesung Bae - Korea Advanced Institute of Science and Technology (KAIST)
contact: [email protected]