vits2_pytorch icon indicating copy to clipboard operation
vits2_pytorch copied to clipboard

AlignerNet instead of MAS

Open codeghees opened this issue 1 year ago • 18 comments

Is it possible to use AlignerNet (aligner.py in pflow-tts repo) instead of MAS in VITS2?

What should be changed in the code? I am a bit confused on what the inputs should be.

codeghees avatar Mar 07 '24 00:03 codeghees

Check out pflow repo for guidance.

p0p4k avatar Mar 07 '24 00:03 p0p4k

Hey, thanks for replying!

I did take most of the code from that repo. I am trying to debug why my alignment curve looks like this: image

align_loss is being added to duration loss. Inputs:

 aln_hard, aln_soft, aln_log, aln_mask = self.aligner(
            m_p.transpose(1,2), x_mask, y, y_mask
            )
attn = aln_mask.transpose(1,2).unsqueeze(1)
align_loss = self.aligner_loss(aln_log, x_lengths, y_lengths)
m_p is returned by TextEncoder.

Appreciate any insights!

codeghees avatar Mar 07 '24 01:03 codeghees

It might be that it is still learning the durations. It think MAS is good enough. What is important is the duration_predictor module.

p0p4k avatar Mar 07 '24 01:03 p0p4k

So this above graph is after training for several days and thousands of steps. It seems like some bug - maybe in shape size or sth.

The output is basically the first word is legible but the rest is basically gibberish.

codeghees avatar Mar 07 '24 01:03 codeghees

I see. Maybe it needs some fixing to do, if you can start a PR we can debug this together. I am busy with other stuff.

p0p4k avatar Mar 07 '24 01:03 p0p4k

Probably same problem as we have in pflow)

Tera2Space avatar Mar 07 '24 14:03 Tera2Space

@Tera2Space What was the problem?

codeghees avatar Mar 07 '24 14:03 codeghees

@Tera2Space What was the problem?

Code geass nice, problem was that it generate wrong aligment, i just though of reason: https://github.com/p0p4k/pflowtts_pytorch/issues/24#issuecomment-1983667975

in your model, what shape is input to alignernet?

Tera2Space avatar Mar 07 '24 15:03 Tera2Space

@Tera2Space @p0p4k I added a basic PR of the changes I have so far: https://github.com/p0p4k/vits2_pytorch/pull/82

codeghees avatar Mar 08 '24 20:03 codeghees

For a single batch - shapes look something like this:

m_p torch.Size([32, 192, 74])
x_mask torch.Size([32, 1, 74])
y_mask torch.Size([32, 1, 293])
m_p torch.Size([32, 192, 74])
x torch.Size([32, 192, 74])
y torch.Size([32, 80, 293])
m_p.transpose(1,2) torch.Size([32, 74, 192])

codeghees avatar Mar 08 '24 20:03 codeghees

For a single batch - shapes look something like this:

m_p torch.Size([32, 192, 74])
x_mask torch.Size([32, 1, 74])
y_mask torch.Size([32, 1, 293])
m_p torch.Size([32, 192, 74])
x torch.Size([32, 192, 74])
y torch.Size([32, 80, 293])
m_p.transpose(1,2) torch.Size([32, 74, 192])

hm than my guess was wrong/there are more problems. I tested my idea on pflow and it didn't work too, so probably we need to check for another problems, I think we should use something like https://github.com/lucidrains/naturalspeech2-pytorch as reference.

Tera2Space avatar Mar 08 '24 21:03 Tera2Space

Iirc, I had yanked the alignernet from there.

p0p4k avatar Mar 08 '24 21:03 p0p4k

Yeah currently trying out various inputs to Aligner; possibly an input issue. It might be an issue with the masks.

codeghees avatar Mar 08 '24 22:03 codeghees

still not convinced about putting efforts in aligner net, we must focus on a better TextEncoder instead.

p0p4k avatar Mar 09 '24 23:03 p0p4k

do you think that's a bottleneck right now?

codeghees avatar Mar 09 '24 23:03 codeghees

For vits2 it should be the duration predictor, and for pflow it should be both textencoder and duration predictor. MAS gives good alignments during training, it is during inference that these models perform worse.

p0p4k avatar Mar 10 '24 20:03 p0p4k

pflow it should be both textencoder

how do you think we can improve pflow's encoder?

Tera2Space avatar Mar 12 '24 21:03 Tera2Space

Hey, thanks for replying!

I did take most of the code from that repo. I am trying to debug why my alignment curve looks like this: image

align_loss is being added to duration loss. Inputs:

 aln_hard, aln_soft, aln_log, aln_mask = self.aligner(
            m_p.transpose(1,2), x_mask, y, y_mask
            )
attn = aln_mask.transpose(1,2).unsqueeze(1)
align_loss = self.aligner_loss(aln_log, x_lengths, y_lengths)
m_p is returned by TextEncoder.

Appreciate any insights!

Is padding included in the encoder timesteps (it seems to me that it is)? You can remove the padded part from the plot.

nicemanis avatar Jul 01 '24 06:07 nicemanis