seq2seq-vc icon indicating copy to clipboard operation
seq2seq-vc copied to clipboard

l2-arctic/cascade vocoder issue after stage 6

Open KevinGengGavo opened this issue 1 year ago • 12 comments

Hi @unilight, long time no see. Congratulation for graduation! I should call you sensei now!

Issue

I finished --stage -1 to stage 5 and generated promising, non-accented bdl-like voices. However, during the non-parallel conversion, I don't think we can get access to the original vocoder config.

vocoder:
  checkpoint: /data/group1/z44476r/Experiments/ParallelWaveGAN/egs/l2-arctic/voc1/exp/train_nodev_TXHC_parallel_wavegan.v1/checkpoint-105000steps.pkl

Shall we change this to our local pwg_TXHC? As you mentioned in lsc/README.md.

Looking forward to your reply!

KevinGengGavo avatar Apr 19 '24 11:04 KevinGengGavo

Also I wonder if the --norm_name self in stage 4 is necessary.

Though you mentioned that in README.md, the default norm_name before stage 3 was ljspeech, so there will only be dump/*/norm_ljspeech rather than dump/*/norm_self for stage 4 decoding.

Should I ignore this?

KevinGengGavo avatar Apr 19 '24 11:04 KevinGengGavo

Hi @KevinGengGavo,

long time no see. Congratulation for graduation! I should call you sensei now!

Sorry I am not quite sure who you are... but thank you :)

Shall we change this to our local pwg_TXHC? As you mentioned in lsc/README.md.

Yes! Please do so.

Also I wonder if the --norm_name self in stage 4 is necessary.

You are right, it is not needed... Sorry I didn't check carefully (and obviously you are probably the only person so far that has tried the implementation...

unilight avatar Apr 20 '24 11:04 unilight

Hi, I’ve tried the pwg_TXHC vocoder after stage 5, and it's somehow working now. However, the artifacts after stage 6 are more than I expected.

Here, I have attached several samples in egs/l2-arctic/cascade/exp/TXHC_bdl_1032_vtn.tts_pt.v1/results/checkpoint-50000steps/TXHC_eval and egs/l2-arctic/cascade/exp/TXHC_bdl_1032_vtn.tts_pt.v1/results/checkpoint-50000steps/stage2_ppg_sxliu_checkpoint-50000steps_TXHC_eval.

The spectrogram doesn't look the same between stage 5 output and stage 6 input, I wonder if it's due to a normalization error.

I would appreciate if you can help.

seq2seq-vc_isuues.zip

KevinGengGavo avatar Apr 20 '24 23:04 KevinGengGavo

Beside, a version conflict occurred during stage 6 at tools/venv/lib/python3.10/site-packages/s3prl_vc/upstream/ppg_sxliu/stft.py, line 95.

Here is my Pytorch env.

torch                    2.0.1
torch-complex            0.4.3
torchaudio               2.0.2

Based on torch.stft, the return_complex parameter is required, while the original implementation ignored this.

I set return_complex=False, and you can see the output of ComplexTensor in nvpc_decode.log. I'm not sure if this is correct, but it is the only method that allows the code to be executable.

KevinGengGavo avatar Apr 20 '24 23:04 KevinGengGavo

Hi @unilight, I would appreciate it if you could take a time to look at this. Thank you.

KevinGengGavo avatar Apr 30 '24 23:04 KevinGengGavo

Beside, a version conflict occurred during stage 6 at tools/venv/lib/python3.10/site-packages/s3prl_vc/upstream/ppg_sxliu/stft.py, line 95.

Here is my Pytorch env.

torch                    2.0.1
torch-complex            0.4.3
torchaudio               2.0.2

Based on torch.stft, the return_complex parameter is required, while the original implementation ignored this.

I set return_complex=False, and you can see the output of ComplexTensor in nvpc_decode.log. I'm not sure if this is correct, but it is the only method that allows the code to be executable.

I got the same issue. you can try add this block at line 641 if not return_complex: return torch.view_as_real(_VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] normalized, onesided, return_complex=True))

This works for me. Good luck

Jasmijn888 avatar May 04 '24 23:05 Jasmijn888

Hi @Jasmijn888 Thanks for your reply. I fixed this issue by adding torch.view_as_real at tools/venv/lib/python3.10/site-packages/s3prl_vc/upstream/ppg_sxliu/stft.py.

Here is my line 69 to 96

        # or (Batch, Channel, Freq, Frames, 2=real_imag)
        if not self.kaldi_padding_mode:
            output = torch.stft(
                input,
                n_fft=self.n_fft,
                win_length=self.win_length,
                hop_length=self.hop_length,
                center=self.center,
                pad_mode=self.pad_mode,
                normalized=self.normalized,
                onesided=self.onesided,
                return_complex=True,
            )
        else:
            # NOTE(sx): Use Kaldi-fasion padding, maybe wrong
            num_pads = self.n_fft - self.win_length
            input = torch.nn.functional.pad(input, (num_pads, 0))
            output = torch.stft(
                input,
                n_fft=self.n_fft,
                win_length=self.win_length,
                hop_length=self.hop_length,
                center=False,
                pad_mode=self.pad_mode,
                normalized=self.normalized,
                onesided=self.onesided,
                return_complex=True,
            )
        # Change complex output to real and imag parts
        output = torch.view_as_real(output)

I don't recommend modifying PyTorch source code anyway. However, thanks for your feedback!

KevinGengGavo avatar May 05 '24 01:05 KevinGengGavo

@Jasmijn888 I'm more curious about you output after stage 6, how does it sound like?

Hi, I’ve tried the pwg_TXHC vocoder after stage 5, and it's somehow working now. However, the artifacts after stage 6 are more than I expected.

Here, I have attached several samples in egs/l2-arctic/cascade/exp/TXHC_bdl_1032_vtn.tts_pt.v1/results/checkpoint-50000steps/TXHC_eval and egs/l2-arctic/cascade/exp/TXHC_bdl_1032_vtn.tts_pt.v1/results/checkpoint-50000steps/stage2_ppg_sxliu_checkpoint-50000steps_TXHC_eval.

The spectrogram doesn't look the same between stage 5 output and stage 6 input, I wonder if it's due to a normalization error.

I would appreciate if you can help.

seq2seq-vc_isuues.zip

After stages 5 and 6, my mel output appears to be fine, but the wav output seems to be overflowing. Would you mind reviewing your output as well?

KevinGengGavo avatar May 05 '24 01:05 KevinGengGavo

Hi @KevinGengGavo, I've tried to run the code on my local server again but I did not encounter the overflow issue. I can only suspect it's because of the new stft argument setting. Can you follow the official recommendation and try to use torch.view_as_real() to recover the real tensor?

unilight avatar May 06 '24 15:05 unilight

@unilight Hi, Dr. Huang! In the cascade method, since mel spectrograms are used for feature extraction, I assume the feature extraction model is language-independent. If I want to train an accent conversion model on a different language, can I start here by using the provided model for feature extraction? Thanks!

Jasmijn888 avatar May 07 '24 01:05 Jasmijn888

Hi @Jasmijn888,

the mel spectrogram is indeed language independent, so you can use it for any language. Though I don’t quite understand what you mean by “ start here by using the provided model for feature extraction”. If you want to use your own dataset, you need to use your desired dataset to train (1) a neural vocoder (ex. ParallelWaveGAN) (2) a non-parallel frame-based model provided by s3prl-vc.

unilight avatar May 07 '24 07:05 unilight

Hi @unilight,

Hi @KevinGengGavo, I've tried to run the code on my local server again but I did not encounter the overflow issue. I can only suspect it's because of the new stft argument setting. Can you follow the official recommendation and try to use torch.view_as_real() to recover the real tensor?

Thanks, I've resolved the STFT problem with the earlier mentioned modification.

I also think I've pinpointed the cause of the audio overflow. I denormalized the output feature in s3prl-vc-decode by modifying the tools/venv/lib/python3.10/site-packages/s3prl_vc/bin/decode.py at line 257.

            # model forward
            out, _, _olens = model(hs, hlens, spk_embs=spemb, f0s=f0s)
            if out.dim() != 2:
                out = out.squeeze(0)
            
            # try denormalize
            if "s3prl-vc-ppg_sxliu" in args.trg_stats:
                out = out * config["trg_stats"]["scale"] + config["trg_stats"]["mean"]
            

This adjustment delivered reasonable results for me. The Mean CER and WER are now 30.2 and 52.5, respectively, similar to those reported in your paper. fac_cascade_denormalized.zip

I'm not sure if there was an issue during my data processing. I'll check if stg also has this same problem.

KevinGengGavo avatar May 09 '24 06:05 KevinGengGavo

@KevinGengGavo This is indeed a bug in the s3prl_vc package, and the solution is indeed to add the line to denormalize the feature. I have fixed it and published the latest s3prl_vc package. If anyone is still having this issue, make sure to update the s3prl_vc package to 0.3.1. Thanks!

unilight avatar Jun 26 '24 05:06 unilight