koremo icon indicating copy to clipboard operation
koremo copied to clipboard

5-class Korean speech emotion classifier

KorEmo

5-class Korean speech emotion classifier

Requirements

python == 3.5, librosa, Keras (TensorFlow), Numpy
If you are version in python >= 3.6, refer to model_6.h5
Also, if your CUDA is 10, then replace the hdf5 file with another model in the forder

Simple usage

 from koremo import pred_emo(filename) 
  • file in .wav format is recommended
  • Output in five labels 0: Angry, 1: Fear, 2: Joy, 3: Normal, 4: Sad
  • ONLY ACOUSTIC DATA is utilized

Data preperation

Voice recorded by two Korean voice actors (1 male, 1 female)

Categorizing emotions

  • Angry (Female: 1,000 / Male: 800)
  • Fear (Female: 500 / Male: 550)
  • Joy (Female: 1,000 / Male: 1,000)
  • Normal (Female: 2,700 / Male: 2,699)
  • Sad (Female: 500 / Male: 800)

The dataset was primarily constructed for the following paper:

@article{lee2018acoustic,
  title={Acoustic Modeling Using Adversarially Trained Variational Recurrent Neural Network for Speech Synthesis},
  author={Lee, Joun Yeop and Cheon, Sung Jun and Choi, Byoung Jin and Kim, Nam Soo and Song, Eunwoo},
  journal={Proc. Interspeech 2018},
  pages={917--921},
  year={2018}
}
  • Cite the ARTICLE for EITHER the reference of the classification criteria and the concept of acoustic feature-based Korean emotion classification. Note that the source .wav files are not disclosed currently.
  • Also, cite THIS repository for the usage of the toolkit.
@misc{cho2018koremo,
  title={KorEmo: 5-class Korean speech emotion classifier},
  author={Cho, Won Ik},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/warnikchow/koremo}},
  year={2018}
}
  • e.g.) The emotion label was tagged by KorEmo\cite{cho2018koremo} which bases originally on the acoustic data constructed for Korean speech synthesis\cite{lee2018acoustic}.

System architecture

  • The model adopts a concatenated structure of CNN and BiLSTM Self-attention, as in KorInto, and the only change is the third convolutional layer window (3 by 3 >> 5 by 5)
  • The model was trained by the code in start.py (the data is not provided), in the environment of python 3.5.
  • The best model shows Accuracy: 96.45% and F1: 0.9644, with train:test set ratio 9:1.