AutoVocoder icon indicating copy to clipboard operation
AutoVocoder copied to clipboard

Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing

Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing

Unofficial Pytorch implementation of Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing. This repository is based on iSTFTNet github (Paper).

Disclaimer : This repo is built for testing purpose.

Training :

python train.py --config config.json

In train.py, change --input_wavs_dir to the directory of LJSpeech-1.1/wavs.
In config.json, change latent_dim for AV128, AV192, and AV256 (Default).
Considering Section 3.3, you can select dec_istft_input between cartesian (Default), polar, and both.

Note:

  • Validation loss of AV256 during training.

  • In our test, it converges almost 3X times faster than HiFi-V1 (referring to the official repo).

Citations :

@article{Webber2022AutovocoderFW,
  title={Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing},
  author={Jacob J. Webber and Cassia Valentini-Botinhao and Evelyn Williams and Gustav Eje Henter and Simon King},
  journal={ArXiv},
  year={2022},
  volume={abs/2211.06989}
}

References:

  • https://github.com/jik876/hifi-gan
  • https://github.com/rishikksh20/iSTFTNet-pytorch