DiffSinger
DiffSinger copied to clipboard
Why use target enegy to generate enegy embeddings during training?
Hi Guys.
thank you very much for your great work!
I have one question regarding the function of get_energy_embedding in modules.py.
I see during training if the target enegy values is not none, the model uses the target ones to generate enegy embeddings instead of the predicted ones? Why?
def get_energy_embedding(self, x, target, mask, control):
x.detach() + self.predictor_grad * (x - x.detach())
prediction = self.energy_predictor(x, squeeze=True)
if target is not None:
embedding = self.energy_embedding(torch.bucketize(target, self.energy_bins))
else:
prediction = prediction * control
embedding = self.energy_embedding(
torch.bucketize(prediction, self.energy_bins)
)
return prediction, embedding
Conversely, the model uses the predicted pitch to generate pitch embeddings.
def get_pitch_embedding(self, decoder_inp, f0, uv, mel2ph, control, encoder_out=None):
pitch_pred = f0_denorm = cwt = f0_mean = f0_std = None
if self.pitch_type == "ph":
pitch_pred_inp = encoder_out.detach() + self.predictor_grad * (encoder_out - encoder_out.detach())
pitch_padding = encoder_out.sum().abs() == 0
pitch_pred = self.pitch_predictor(pitch_pred_inp) * control
if f0 is None:
f0 = pitch_pred[:, :, 0]
f0_denorm = denorm_f0(f0, None, self.preprocess_config["preprocessing"]["pitch"], pitch_padding=pitch_padding)
pitch = f0_to_coarse(f0_denorm) # start from 0 [B, T_txt]
pitch = F.pad(pitch, [1, 0])
pitch = torch.gather(pitch, 1, mel2ph) # [B, T_mel]
pitch_embed = self.pitch_embed(pitch)
Could you please help to answer it?
Thank you!