melgan-neurips
melgan-neurips copied to clipboard
about final loss?
Could you post your training loss curves with LJSpeech dataset? After 3k epochs, synthesized waves by my trained model are of poor quality compared to the released model "linda_johnson.pt". I wonder what your final losses are and whether there are some other training tricks. Thx in advance. (Here are some synthesized samples samples.zip )
@MorganCZY This model needs a lot of training steps. I've trained one a million steps and it sounds great.
But steps are more than two million after 3k epochs. Could you upload your tensorboard graphs? I want to check if the losses of my training process are of correct track.
I think it depends a lot on the taco (or whatever) Mel quality. Did you use your own model?
@m-toman I only tried to train a vocoder, not a whole TTS system. So, true mel-spectrograms rather than the outputs of a taco are used to train this MelGAN.
What about level of s_error can get an understandable audio ?
But steps are more than two million after 3k epochs. Could you upload your tensorboard graphs? I want to check if the losses of my training process are of correct track.
Hello, how long does it take you to train one step? I used batch = 16 to train on the RTX2080 for more than 3 seconds.
I train the model on the Chinese corpus SLR38 for 1.2 million steps and the generated result for an unseen speaker (from the same corpus but not in the training set) sounds really good.
It's worth noting that the generated audios still have some background noise when training for 0.9 million steps.
Using batch size 2, with a single RTX2080, the training speed is 100 steps per 17 seconds.
I changed the mel generation function a little bit to match other open source implementations (e.g. https://github.com/fatchord/WaveRNN/blob/master/utils/dsp.py):
class Audio2Mel(nn.Module):
def __init__(
self,
n_fft=1024,
hop_length=256,
win_length=1024,
sampling_rate=16000,
n_mel_channels=80,
mel_fmin=0.0,
mel_fmax=None,
min_level_db=16,
):
super().__init__()
##############################################
# FFT Parameters #
##############################################
window = torch.hann_window(win_length).float()
mel_basis = librosa_mel_fn(
sampling_rate, n_fft, n_mel_channels, mel_fmin, mel_fmax
)
mel_basis = torch.from_numpy(mel_basis).float()
self.register_buffer("mel_basis", mel_basis)
self.register_buffer("window", window)
self.n_fft = n_fft
self.hop_length = hop_length
self.win_length = win_length
self.sampling_rate = sampling_rate
self.n_mel_channels = n_mel_channels
self.min_level_db = min_level_db
def forward(self, audio):
p = (self.n_fft - self.hop_length) // 2
audio = F.pad(audio, (p, p), "reflect").squeeze(1)
fft = torch.stft(
audio,
n_fft=self.n_fft,
hop_length=self.hop_length,
win_length=self.win_length,
window=self.window,
center=False,
)
real_part, imag_part = fft.unbind(-1)
magnitude = torch.sqrt(real_part ** 2 + imag_part ** 2)
mel_output = torch.matmul(self.mel_basis, magnitude)
log_mel_spec = 20. * torch.log10(torch.clamp(mel_output, min=1e-5))
log_mel_spec = torch.clamp((log_mel_spec - self.min_level_db) / -self.min_level_db, 0, 1)
return log_mel_spec
data:image/s3,"s3://crabby-images/4a3b7/4a3b76be0ad19a85fbd6b6cdfca20e189442bbc2" alt="Screen Shot 2020-01-02 at 10 28 37 AM"
@ye2020 How did you control the speaker embeddings? Besides, could you release some wav samples?
@MorganCZY Training the model on a multiple speaker corpus and it will generalize automatically. You can just listen the audio samples released by the authors, the results in the "Samples along Training" section are very close to mine.
@ye2020 I've been training for 9 hours and it hasn't stopped. How long did you train the sample? Do you train with CPU or GPU?
@MorganCZY Training the model on a multiple speaker corpus and it will generalize automatically. You can just listen the audio samples released by the authors, the results in the "Samples along Training" section are very close to mine.
why mini_level_db=16? I think it is -100.
@MorganCZY Training the model on a multiple speaker corpus and it will generalize automatically. You can just listen the audio samples released by the authors, the results in the "Samples along Training" section are very close to mine.
why mini_level_db=16? I think it is -100.
Yes, it's -100.
min_level_db = -100
sample_rate = 16000
n_fft = 1024
num_mels = 80
fmin = 90
fmax = 7600
hop_length = 256
win_length = 1024
ref_level_db = 16
@MorganCZY Training the model on a multiple speaker corpus and it will generalize automatically. You can just listen the audio samples released by the authors, the results in the "Samples along Training" section are very close to mine.
why mini_level_db=16? I think it is -100.
Yes, it's -100.
min_level_db = -100 sample_rate = 16000 n_fft = 1024 num_mels = 80 fmin = 90 fmax = 7600 hop_length = 256 win_length = 1024 ref_level_db = 16
ref_level_db用到吗
@ye2020 when you predict vocoder output ,what is your vocoder input, it's from your tacotron output mel?
@ye2020 Can you please send me the complete training code for AutoVC. I am struggling to get the same output. You can email me on [email protected]
I train the model on the Chinese corpus SLR38 for 1.2 million steps and the generated result for an unseen speaker (from the same corpus but not in the training set) sounds really good. It's worth noting that the generated audios still have some background noise when training for 0.9 million steps. Using batch size 2, with a single RTX2080, the training speed is 100 steps per 17 seconds. I changed the mel generation function a little bit to match other open source implementations (e.g. https://github.com/fatchord/WaveRNN/blob/master/utils/dsp.py): class Audio2Mel(nn.Module): def init( self, n_fft=1024, hop_length=256, win_length=1024, sampling_rate=16000, n_mel_channels=80, mel_fmin=0.0, mel_fmax=None, min_level_db=16, ): super().init() ############################################## # FFT Parameters # ############################################## window = torch.hann_window(win_length).float() mel_basis = librosa_mel_fn( sampling_rate, n_fft, n_mel_channels, mel_fmin, mel_fmax ) mel_basis = torch.from_numpy(mel_basis).float() self.register_buffer("mel_basis", mel_basis) self.register_buffer("window", window) self.n_fft = n_fft self.hop_length = hop_length self.win_length = win_length self.sampling_rate = sampling_rate self.n_mel_channels = n_mel_channels self.min_level_db = min_level_db
def forward(self, audio): p = (self.n_fft - self.hop_length) // 2 audio = F.pad(audio, (p, p), "reflect").squeeze(1) fft = torch.stft( audio, n_fft=self.n_fft, hop_length=self.hop_length, win_length=self.win_length, window=self.window, center=False, ) real_part, imag_part = fft.unbind(-1) magnitude = torch.sqrt(real_part ** 2 + imag_part ** 2) mel_output = torch.matmul(self.mel_basis, magnitude) log_mel_spec = 20. * torch.log10(torch.clamp(mel_output, min=1e-5)) log_mel_spec = torch.clamp((log_mel_spec - self.min_level_db) / -self.min_level_db, 0, 1) return log_mel_spec
with the same Chinese corpus and the same hparameters, after training 1.2milion steps, the sound quality of the waveforms i got is very noisy. I don't know why