understanding-ai
understanding-ai copied to clipboard
Neural Voice Cloning with a Few Samples
https://arxiv.org/abs/1802.06006 Paper from Baidu Research
Abstract
Paper will do
- Speaker adaption
- fine-tuning a multi-speaker generative model
- Speaker encoding
- infer speaker embedding which will be used with a multi-speaker generative model
1. Introduction
- Text carries linguistic information
- Speaker representation captures speaker's characteristics (pitch, speech rate, accent)
- This paper focuses on voice cloning
- Compares speech naturalness, speaker similarity, cloning/inference time, model footprint
2. Voice Cloning
Paper Notations
- f: multi-speaker generative model
- g: speaker encoding function
- t: text
- s: speaker
- a: audio
- S: speaker set
- A: audio set
2.1. Speaker adaption
Speaker adaption function
2.2. Speaker encoding
Speaker encoding function
Paper avoids mode collapse with training speaker encoder seperately
Loss function (L1)
Architecture
- Spectral processing
- Temporal processing
- Cloning sample attention
- uses multi-head self-attention from Transformer
2.3. Discriminative models for evaluation
Because human is so expensive, paper propose those two solutions for evaluation
2.3.1. Speaker Classification
- Put additional embedding layer before softmax function from whole architecture
2.3.2. Speaker Verification
- binary classification wheter the test audio and enrolled audio are same speaker
Experiments
3.1. Datasets
- LibriSpeech dataset for multi-speaker generative model & speaker encoder model
- sampling from VCTK for voice cloning