Speech Super-resolution with Unconditional Diffwave

Source code of the paper Conditioning and Sampling in Variational Diffusion Models for Speech Super-Resolution.

Training

Install python requirements.

pip install requirements.txt

Please convert all the data files into .wav format and put them under the same directory. The following command will train a 48 kHz UDM.

python train.py model.res_channels=64 epochs=50 sr=48000 train_T=0 dataset.size=120000 dataset.segment=32768 dataset.data_dir=/your/vctk/train/set/ loader.batch_size=12 scheduler.patience=1000000

Evaluation

The numbers in the paper can be reproduced with following commands.

rate: the upscaling ratio.
downsample-type: the downsampling filter.
infer-type: the upscaling method.
lr: the $\eta$ value in the paper.

Spline Interpolation

python vctk_dsp_baseline.py /your/vctk/test/set/ --downsample-type sinc --infer-type spline --rate 2

UDM+

python -W ignore vctk_infer.py outputs/XXXX/saved/training_checkpoint_500000.pt outputs/XXXX/.hydra/config.yaml /your/vctk/test/set --rate 2 -T 50 --infer-type manifold --downsample-type stft --lr 0.67

UDM+ without MCG

python -W ignore vctk_infer.py outputs/XXXX/saved/training_checkpoint_500000.pt outputs/XXXX/.hydra/config.yaml /your/vctk/test/set --rate 3 -T 50 --infer-type inpainting --downsample-type sinc

NU-Wave(+)

The checkpoint of UDM is used for noise scheduling. For training NU-Wave, please refer to here. For evaluating NU-Wave+, change infer-type to nuwave-manifold and specify the value of lr.

python -W ignore vctk_infer.py outputs/XXXX/saved/training_checkpoint_500000.pt outputs/XXXX/.hydra/config.yaml /your/vctk/test/set --nuwave-ckpt /XXXX/checkpoints_nuwave_x2/nuwave_x2_01_07_22_epoch\=645_EMA --rate 2 -T 50 --infer-type nuwave --downsample-type stft

NU-Wave 2(+)

The checkpoint of UDM is used for noise scheduling. For training NU-Wave 2, please refer to here. For evaluating NU-Wave 2+, change infer-type to nuwave2-manifold and specify the value of lr.

python -W ignore vctk_infer.py outputs/XXXX/saved/training_checkpoint_500000.pt outputs/XXXX/.hydra/config.yaml /your/vctk/test/set --nuwave-ckpt /XXXX/nuwave2_08_14_09_epoch\=72_EMA --rate 3 -T 50 --infer-type nuwave2 --downsample-type sinc

We'll release the script for evaluating WSRGlow and NVSR in the future.

Pre-trained Checkpoints

48 kHz
16 kHz

Extending to non-zero phase response lowpass filters

When using IIR lowpass filter to downsample audio, it introduces non-linear phase delays, thus breaking the assumption that the frequency mask is real value. An easy solution to compensate for the delays is applying the same filter again during upsampling but in a backward direction of time. We conducted the same 48 kHz experiment in the paper again but with a 8th order Chebyshev Type I lowpass filter.

	2x	3x
NU-Wave	0.87	1.00
NU-Wave 2	0.73	0.87
NU-Wave+	1.03	1.32
NU-Wave 2+	0.86	1.00
UDM+	0.64	0.79

diffwave-sr
diffwave-sr copied to clipboard

Metadata

Speech Super-resolution with Unconditional Diffwave

Training

Evaluation

Spline Interpolation

UDM+

UDM+ without MCG

NU-Wave(+)

NU-Wave 2(+)

Pre-trained Checkpoints

Extending to non-zero phase response lowpass filters

← Metadata

Owner

Metadata

diffwave-sr diffwave-sr copied to clipboard

Metadata

Speech Super-resolution with Unconditional Diffwave

Training

Evaluation

Spline Interpolation

UDM+

UDM+ without MCG

NU-Wave(+)

NU-Wave 2(+)

Pre-trained Checkpoints

Extending to non-zero phase response lowpass filters

← Metadata

Owner

Metadata

diffwave-sr
diffwave-sr copied to clipboard