l2-arctic/cascade vocoder issue after stage 6
Hi @unilight, long time no see. Congratulation for graduation! I should call you sensei now!
Issue
I finished --stage -1 to stage 5 and generated promising, non-accented bdl-like voices.
However, during the non-parallel conversion, I don't think we can get access to the original vocoder config.
vocoder:
checkpoint: /data/group1/z44476r/Experiments/ParallelWaveGAN/egs/l2-arctic/voc1/exp/train_nodev_TXHC_parallel_wavegan.v1/checkpoint-105000steps.pkl
Shall we change this to our local pwg_TXHC? As you mentioned in lsc/README.md.
Looking forward to your reply!
Also I wonder if the --norm_name self in stage 4 is necessary.
Though you mentioned that in README.md, the default norm_name before stage 3 was ljspeech, so there will only be dump/*/norm_ljspeech rather than dump/*/norm_self for stage 4 decoding.
Should I ignore this?
Hi @KevinGengGavo,
long time no see. Congratulation for graduation! I should call you sensei now!
Sorry I am not quite sure who you are... but thank you :)
Shall we change this to our local pwg_TXHC? As you mentioned in lsc/README.md.
Yes! Please do so.
Also I wonder if the --norm_name self in stage 4 is necessary.
You are right, it is not needed... Sorry I didn't check carefully (and obviously you are probably the only person so far that has tried the implementation...
Hi, I’ve tried the pwg_TXHC vocoder after stage 5, and it's somehow working now. However, the artifacts after stage 6 are more than I expected.
Here, I have attached several samples in egs/l2-arctic/cascade/exp/TXHC_bdl_1032_vtn.tts_pt.v1/results/checkpoint-50000steps/TXHC_eval and egs/l2-arctic/cascade/exp/TXHC_bdl_1032_vtn.tts_pt.v1/results/checkpoint-50000steps/stage2_ppg_sxliu_checkpoint-50000steps_TXHC_eval.
The spectrogram doesn't look the same between stage 5 output and stage 6 input, I wonder if it's due to a normalization error.
I would appreciate if you can help.
Beside, a version conflict occurred during stage 6 at tools/venv/lib/python3.10/site-packages/s3prl_vc/upstream/ppg_sxliu/stft.py, line 95.
Here is my Pytorch env.
torch 2.0.1
torch-complex 0.4.3
torchaudio 2.0.2
Based on torch.stft, the return_complex parameter is required, while the original implementation ignored this.
I set return_complex=False, and you can see the output of ComplexTensor in nvpc_decode.log. I'm not sure if this is correct, but it is the only method that allows the code to be executable.
Hi @unilight, I would appreciate it if you could take a time to look at this. Thank you.
Beside, a version conflict occurred during
stage 6attools/venv/lib/python3.10/site-packages/s3prl_vc/upstream/ppg_sxliu/stft.py, line 95.Here is my Pytorch env.
torch 2.0.1 torch-complex 0.4.3 torchaudio 2.0.2Based on torch.stft, the
return_complexparameter is required, while the original implementation ignored this.I set
return_complex=False, and you can see the output ofComplexTensorinnvpc_decode.log. I'm not sure if this is correct, but it is the only method that allows the code to be executable.
I got the same issue. you can try add this block at line 641 if not return_complex: return torch.view_as_real(_VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] normalized, onesided, return_complex=True))
This works for me. Good luck
Hi @Jasmijn888
Thanks for your reply.
I fixed this issue by adding torch.view_as_real at tools/venv/lib/python3.10/site-packages/s3prl_vc/upstream/ppg_sxliu/stft.py.
Here is my line 69 to 96
# or (Batch, Channel, Freq, Frames, 2=real_imag)
if not self.kaldi_padding_mode:
output = torch.stft(
input,
n_fft=self.n_fft,
win_length=self.win_length,
hop_length=self.hop_length,
center=self.center,
pad_mode=self.pad_mode,
normalized=self.normalized,
onesided=self.onesided,
return_complex=True,
)
else:
# NOTE(sx): Use Kaldi-fasion padding, maybe wrong
num_pads = self.n_fft - self.win_length
input = torch.nn.functional.pad(input, (num_pads, 0))
output = torch.stft(
input,
n_fft=self.n_fft,
win_length=self.win_length,
hop_length=self.hop_length,
center=False,
pad_mode=self.pad_mode,
normalized=self.normalized,
onesided=self.onesided,
return_complex=True,
)
# Change complex output to real and imag parts
output = torch.view_as_real(output)
I don't recommend modifying PyTorch source code anyway. However, thanks for your feedback!
@Jasmijn888 I'm more curious about you output after stage 6, how does it sound like?
Hi, I’ve tried the
pwg_TXHCvocoder afterstage 5, and it's somehow working now. However, the artifacts afterstage 6are more than I expected.Here, I have attached several samples in
egs/l2-arctic/cascade/exp/TXHC_bdl_1032_vtn.tts_pt.v1/results/checkpoint-50000steps/TXHC_evalandegs/l2-arctic/cascade/exp/TXHC_bdl_1032_vtn.tts_pt.v1/results/checkpoint-50000steps/stage2_ppg_sxliu_checkpoint-50000steps_TXHC_eval.The spectrogram doesn't look the same between
stage 5output andstage 6input, I wonder if it's due to a normalization error.I would appreciate if you can help.
After stages 5 and 6, my mel output appears to be fine, but the wav output seems to be overflowing. Would you mind reviewing your output as well?
Hi @KevinGengGavo, I've tried to run the code on my local server again but I did not encounter the overflow issue. I can only suspect it's because of the new stft argument setting. Can you follow the official recommendation and try to use torch.view_as_real() to recover the real tensor?
@unilight Hi, Dr. Huang! In the cascade method, since mel spectrograms are used for feature extraction, I assume the feature extraction model is language-independent. If I want to train an accent conversion model on a different language, can I start here by using the provided model for feature extraction? Thanks!
Hi @Jasmijn888,
the mel spectrogram is indeed language independent, so you can use it for any language. Though I don’t quite understand what you mean by “ start here by using the provided model for feature extraction”. If you want to use your own dataset, you need to use your desired dataset to train (1) a neural vocoder (ex. ParallelWaveGAN) (2) a non-parallel frame-based model provided by s3prl-vc.
Hi @unilight,
Hi @KevinGengGavo, I've tried to run the code on my local server again but I did not encounter the overflow issue. I can only suspect it's because of the new stft argument setting. Can you follow the official recommendation and try to use torch.view_as_real() to recover the real tensor?
Thanks, I've resolved the STFT problem with the earlier mentioned modification.
I also think I've pinpointed the cause of the audio overflow.
I denormalized the output feature in s3prl-vc-decode by modifying the tools/venv/lib/python3.10/site-packages/s3prl_vc/bin/decode.py at line 257.
# model forward
out, _, _olens = model(hs, hlens, spk_embs=spemb, f0s=f0s)
if out.dim() != 2:
out = out.squeeze(0)
# try denormalize
if "s3prl-vc-ppg_sxliu" in args.trg_stats:
out = out * config["trg_stats"]["scale"] + config["trg_stats"]["mean"]
This adjustment delivered reasonable results for me. The Mean CER and WER are now 30.2 and 52.5, respectively, similar to those reported in your paper. fac_cascade_denormalized.zip
I'm not sure if there was an issue during my data processing. I'll check if stg also has this same problem.
@KevinGengGavo This is indeed a bug in the s3prl_vc package, and the solution is indeed to add the line to denormalize the feature. I have fixed it and published the latest s3prl_vc package. If anyone is still having this issue, make sure to update the s3prl_vc package to 0.3.1. Thanks!