Kaizhi Qian comments

Results 196 comments of


                                            Kaizhi Qian

Does it will work on unseen data also? will it be able to convert voice of unseen speaker with different content than that of data in training, will we obtain the disentanglement?

sounds correct, but you don't need to remove the validation part

Downsampling process is different from that described in the paper

Thanks. The code is correct. 2 seconds.

Whether the speech on each batch will be crop to a fixed length of time during training?

Neither affects the performance nor the training speed.

Whether the speech on each batch will be crop to a fixed length of time during training?

Either way works. Did not exceed 3 sec. to fit into memory.

How you generate speaker embedding?

You can use one-hot embeddings if you are not doing zero-shot conversion. I implemented my own speaker encoder, which has not been released. The Resemblyzer is just a similar implementation...

If speaker embedding is not added to the encoder input, will it affect the model effect?

No fundamental difference

Inference with new input audio

By conditioning on speaker embedding, it changes the rhythm and timbre at the same time.

Inference with new input audio

`with torch.no_grad(): spect_output, len_spect = P.infer_onmt(cep_real_A.transpose(2,1)[:,:14,:], real_mask_A, len_real_A, spk_emb_B)`

Why need original speaker embeddings concatenated with original speaker spectrogram?

So that the encoder dose not need to learn that information from the spectrogram.

Why need original speaker embeddings concatenated with original speaker spectrogram?

content encoder Without the speaker emb, it is harder for the encoder to learn that information from the spectrogram. Since you already have that info, just give it to the...