nkcdy comments

Results 21 comments of


                                            nkcdy

The output encoder

what I mean is the content encoder for the output signal, not the decoder.

How to generate mel spectrogram

> num_mels: 80 > fmin: 90 > fmax: 7600 > fft_size: 1024 > hop_size: 256 > min_level_db: -100 > ref_level_db: 16 Thanks a lot. the quality is improved with the...

How to generate mel spectrogram

Another question is about the speaker embeddings. The speaker embedding in metadata.pkl is a scalar with 256-dimensions, but i got a matrix with the size of N*256 when I use...

How to generate mel spectrogram

Yes, the embedding is metadata.pkl is a vector of length 256. But I got several d-vector with length of 256 even if i use a single wave file(p225_001.wav). I did...

How to generate mel spectrogram

It didnt work... :(. I noticed that the sampling rate of TIMIT corpus used in https://github.com/HarryVolek/PyTorch_Speaker_Verification is 16KHz while the sampling rate in VCTK corpus is 48kHz. Should I re-train...

How to generate mel spectrogram

> The details are described in the paper. I still can not reproduce your reults as shown in the demo. what i got were babbles. The sampling rate of all...

Why need original speaker embeddings concatenated with original speaker spectrogram?

@auspicious3000 I don't quite understand what you mean. The "encoder" you mentioned is the content encoder or the speaker encoder? can you please explain it in more detail? I can't...

Why need original speaker embeddings concatenated with original speaker spectrogram?

@auspicious3000 its still not intuitive for me to understand what you said. Maybe I should read some related papers behind this theory. Do you have any recommended papers?

Why need original speaker embeddings concatenated with original speaker spectrogram?

@auspicious3000 yes, from the whole network view, you are right. Suppose B is the emb_trg and A is original content, it is intuitive. My question is the why emb_org is...

Why need original speaker embeddings concatenated with original speaker spectrogram?

@auspicious3000 here is where i got confused. The content encoder is just an encoder or just an network. An network can generate any possible output unless it get some guide....