melgan-neurips about final loss?

Could you post your training loss curves with LJSpeech dataset? After 3k epochs, synthesized waves by my trained model are of poor quality compared to the released model "linda_johnson.pt". I wonder what your final losses are and whether there are some other training tricks. Thx in advance. (Here are some synthesized samples samples.zip )

Nov 05 '19 09:11 MorganCZY

@MorganCZY This model needs a lot of training steps. I've trained one a million steps and it sounds great.

Nov 11 '19 10:11 fatchord

But steps are more than two million after 3k epochs. Could you upload your tensorboard graphs? I want to check if the losses of my training process are of correct track.

Nov 11 '19 10:11 MorganCZY

I think it depends a lot on the taco (or whatever) Mel quality. Did you use your own model?

Nov 11 '19 10:11 m-toman

@m-toman I only tried to train a vocoder, not a whole TTS system. So, true mel-spectrograms rather than the outputs of a taco are used to train this MelGAN.

Nov 12 '19 01:11 MorganCZY

What about level of s_error can get an understandable audio ?

Nov 15 '19 01:11 hyzhan

But steps are more than two million after 3k epochs. Could you upload your tensorboard graphs? I want to check if the losses of my training process are of correct track.

Hello, how long does it take you to train one step? I used batch = 16 to train on the RTX2080 for more than 3 seconds.

Nov 20 '19 13:11 JunjunCui

I train the model on the Chinese corpus SLR38 for 1.2 million steps and the generated result for an unseen speaker (from the same corpus but not in the training set) sounds really good.

It's worth noting that the generated audios still have some background noise when training for 0.9 million steps.

Using batch size 2, with a single RTX2080, the training speed is 100 steps per 17 seconds.

I changed the mel generation function a little bit to match other open source implementations (e.g. https://github.com/fatchord/WaveRNN/blob/master/utils/dsp.py):

class Audio2Mel(nn.Module):
    def __init__(
        self,
        n_fft=1024,
        hop_length=256,
        win_length=1024,
        sampling_rate=16000,
        n_mel_channels=80,
        mel_fmin=0.0,
        mel_fmax=None,
        min_level_db=16,
    ):
        super().__init__()
        ##############################################
        # FFT Parameters                              #
        ##############################################
        window = torch.hann_window(win_length).float()
        mel_basis = librosa_mel_fn(
            sampling_rate, n_fft, n_mel_channels, mel_fmin, mel_fmax
        )
        mel_basis = torch.from_numpy(mel_basis).float()
        self.register_buffer("mel_basis", mel_basis)
        self.register_buffer("window", window)
        self.n_fft = n_fft
        self.hop_length = hop_length
        self.win_length = win_length
        self.sampling_rate = sampling_rate
        self.n_mel_channels = n_mel_channels
        self.min_level_db = min_level_db

    def forward(self, audio):
        p = (self.n_fft - self.hop_length) // 2
        audio = F.pad(audio, (p, p), "reflect").squeeze(1)
        fft = torch.stft(
            audio,
            n_fft=self.n_fft,
            hop_length=self.hop_length,
            win_length=self.win_length,
            window=self.window,
            center=False,
        )
        real_part, imag_part = fft.unbind(-1)
        magnitude = torch.sqrt(real_part ** 2 + imag_part ** 2)
        mel_output = torch.matmul(self.mel_basis, magnitude)
        log_mel_spec = 20. * torch.log10(torch.clamp(mel_output, min=1e-5))
        log_mel_spec = torch.clamp((log_mel_spec - self.min_level_db) / -self.min_level_db, 0, 1)
        return log_mel_spec

Jan 02 '20 02:01 himajin2045

@ye2020 How did you control the speaker embeddings? Besides, could you release some wav samples?

Jan 02 '20 03:01 MorganCZY

@MorganCZY Training the model on a multiple speaker corpus and it will generalize automatically. You can just listen the audio samples released by the authors, the results in the "Samples along Training" section are very close to mine.

Jan 02 '20 03:01 himajin2045

@ye2020 I've been training for 9 hours and it hasn't stopped. How long did you train the sample? Do you train with CPU or GPU?

Feb 22 '20 18:02 nikawool

@MorganCZY Training the model on a multiple speaker corpus and it will generalize automatically. You can just listen the audio samples released by the authors, the results in the "Samples along Training" section are very close to mine.

why mini_level_db=16? I think it is -100.

Apr 02 '20 13:04 hdmjdp

@MorganCZY Training the model on a multiple speaker corpus and it will generalize automatically. You can just listen the audio samples released by the authors, the results in the "Samples along Training" section are very close to mine.

why mini_level_db=16? I think it is -100.

Yes, it's -100.

min_level_db = -100
sample_rate = 16000
n_fft = 1024
num_mels = 80
fmin = 90
fmax = 7600
hop_length = 256
win_length = 1024
ref_level_db = 16

Apr 02 '20 13:04 himajin2045

@MorganCZY Training the model on a multiple speaker corpus and it will generalize automatically. You can just listen the audio samples released by the authors, the results in the "Samples along Training" section are very close to mine.

why mini_level_db=16? I think it is -100.

Yes, it's -100.
min_level_db = -100
sample_rate = 16000
n_fft = 1024
num_mels = 80
fmin = 90
fmax = 7600
hop_length = 256
win_length = 1024
ref_level_db = 16

ref_level_db用到吗

Apr 02 '20 14:04 hdmjdp

@ye2020 when you predict vocoder output ,what is your vocoder input, it's from your tacotron output mel?

Apr 07 '20 11:04 plutols

@ye2020 Can you please send me the complete training code for AutoVC. I am struggling to get the same output. You can email me on [email protected]

Apr 22 '20 07:04 MayukhSobo

I train the model on the Chinese corpus SLR38 for 1.2 million steps and the generated result for an unseen speaker (from the same corpus but not in the training set) sounds really good. It's worth noting that the generated audios still have some background noise when training for 0.9 million steps. Using batch size 2, with a single RTX2080, the training speed is 100 steps per 17 seconds. I changed the mel generation function a little bit to match other open source implementations (e.g. https://github.com/fatchord/WaveRNN/blob/master/utils/dsp.py): class Audio2Mel(nn.Module): def init( self, n_fft=1024, hop_length=256, win_length=1024, sampling_rate=16000, n_mel_channels=80, mel_fmin=0.0, mel_fmax=None, min_level_db=16, ): super().init() ############################################## # FFT Parameters # ############################################## window = torch.hann_window(win_length).float() mel_basis = librosa_mel_fn( sampling_rate, n_fft, n_mel_channels, mel_fmin, mel_fmax ) mel_basis = torch.from_numpy(mel_basis).float() self.register_buffer("mel_basis", mel_basis) self.register_buffer("window", window) self.n_fft = n_fft self.hop_length = hop_length self.win_length = win_length self.sampling_rate = sampling_rate self.n_mel_channels = n_mel_channels self.min_level_db = min_level_db
def forward(self, audio):
    p = (self.n_fft - self.hop_length) // 2
    audio = F.pad(audio, (p, p), "reflect").squeeze(1)
    fft = torch.stft(
        audio,
        n_fft=self.n_fft,
        hop_length=self.hop_length,
        win_length=self.win_length,
        window=self.window,
        center=False,
    )
    real_part, imag_part = fft.unbind(-1)
    magnitude = torch.sqrt(real_part ** 2 + imag_part ** 2)
    mel_output = torch.matmul(self.mel_basis, magnitude)
    log_mel_spec = 20. * torch.log10(torch.clamp(mel_output, min=1e-5))
    log_mel_spec = torch.clamp((log_mel_spec - self.min_level_db) / -self.min_level_db, 0, 1)
    return log_mel_spec

with the same Chinese corpus and the same hparameters, after training 1.2milion steps, the sound quality of the waveforms i got is very noisy. I don't know why

Aug 18 '20 04:08 nkcdy

melgan-neurips melgan-neurips copied to clipboard

about final loss?

melgan-neurips
melgan-neurips copied to clipboard