vishalbhavani comments

Results 10 comments of


                                            vishalbhavani

Test time speaker adaptation

Thanks for the clarification. I was more concerned about the identity mismatch than the accent difference. The voice texture of boman and prosenjit is not preserved in the generated audios....

Test time speaker adaptation

Thanks for the confirmation. I can send a PR with TSA support. I am facing an issue - when I try to use the predicted mels from the code in...

Test time speaker adaptation

Directly optimizing for L1 loss using the code in inference.py(with the mel length fix) results in further deterioration. I can see that the forward pass in train and inference differ....

Actually, I kept the speaker encoding tensor trainable(initialized to `tgt_attributes["a_s"]`) and froze all parameters of `Tester` . Based on your suggestion, I used `inferece_exact_pitch` which improved the results. Trained on...

Test time speaker adaptation

1. You mean plugging in the adversarial loss right? I haven't used it yet 2. The TSA proposed in NANSY does exactly this. Keeps the speaker representation trainable and backdrops...

Test time speaker adaptation

I agree with both. Sharing the code. Let me know if you want to look at the audio samples as well. [code.zip](https://github.com/b04901014/UUVC/files/10087631/code.zip) inference_exact_pitch.py contains the exact original code which is...

Test time speaker adaptation

It is indeed high. I deliberately increased it while doing initial experiments and forgot to revert it back. Yes, loss is going down. Tried `5e-5` but the improvement felt slow....

Test time speaker adaptation

Also, why do we need https://github.com/b04901014/UUVC/blob/master/inference_exact_pitch.py#L162 ? I had to comment it out to make the shapes match

Test time speaker adaptation

Also, the reconstructed audios have different volumes compared to the original audios. Any idea why is that?

Test time speaker adaptation

So my current code has the following changes: 1. TSA to reconstruct target audio 2. Speaker identity as a parameter and freeze the model completely 3. Use exact duration and...