understanding-ai icon indicating copy to clipboard operation
understanding-ai copied to clipboard

Neural Voice Cloning with a Few Samples

Open flrngel opened this issue 6 years ago • 0 comments

https://arxiv.org/abs/1802.06006 Paper from Baidu Research

Abstract

Paper will do

  • Speaker adaption
    • fine-tuning a multi-speaker generative model
  • Speaker encoding
    • infer speaker embedding which will be used with a multi-speaker generative model

1. Introduction

  • Text carries linguistic information
  • Speaker representation captures speaker's characteristics (pitch, speech rate, accent)
  • This paper focuses on voice cloning
  • Compares speech naturalness, speaker similarity, cloning/inference time, model footprint

2. Voice Cloning

image

Paper Notations

  • f: multi-speaker generative model
  • g: speaker encoding function
  • t: text
  • s: speaker
  • a: audio
  • S: speaker set
  • A: audio set

2.1. Speaker adaption

Speaker adaption function

image

2.2. Speaker encoding

Speaker encoding function

image Paper avoids mode collapse with training speaker encoder seperately

Loss function (L1)

image

Architecture

image

  • Spectral processing
  • Temporal processing
  • Cloning sample attention
    • uses multi-head self-attention from Transformer

2.3. Discriminative models for evaluation

Because human is so expensive, paper propose those two solutions for evaluation

2.3.1. Speaker Classification

  • Put additional embedding layer before softmax function from whole architecture

2.3.2. Speaker Verification

  • binary classification wheter the test audio and enrolled audio are same speaker image

Experiments

3.1. Datasets

  • LibriSpeech dataset for multi-speaker generative model & speaker encoder model
  • sampling from VCTK for voice cloning

flrngel avatar Mar 04 '18 04:03 flrngel