autovc F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder

I'm trying to improve the model by implementing the pitch conditioning introduced in https://arxiv.org/abs/2004.07370. However the process of producing normalized quantized log-F0 seems a bit confusing, as there are more than one way you could compute mean µ and std σ.

A sample's pitch vector is a 1d array whose size is n, where n is the number of frames (which seems to be fixed at 128 according to https://github.com/auspicious3000/autovc/issues/6#issuecomment-509202251). So there are three ways of computing µ and σ:

Suppose f0 is extracted from a sample audio of speaker A.

Compute µ and σ of each individual sample on the fly (f0_norm = (f0 - f0.mean()) / f0.std() / 4).
Compute µ and σ for each speaker (f0_norm = (f0 - f0s.mean()) / f0s.std() / 4 where f0s is an A x 128 array with A being the total number of samples of speaker A).
Compute universal µ and σ of every sample (f0_norm = (f0 - f0s.mean()) / f0s.std() / 4 where f0s is an N x 128 array with N being the total number of all samples - that is, A < N).

And assuming the answer is 2 or 3, for unseen-to-seen or unseen-to-unseen conversion am I correct that µ and σ should be stored somewhere safe so I can reuse those values for inference? (I guess the option 2 doesn't really make sense since you can't compute those for unseen speakers)

Jul 14 '20 15:07 tebin

The answer is 2. You will need µ and σ for inference. However, for unseen speakers, you can normalize using its own µ and σ, which is not a bad approximation.

Jul 15 '20 10:07 auspicious3000

@auspicious3000 Thanks for the response! As a followup question, could you confirm whether the following pipeline for data augmentation is correct?

Since we are now using randomly cropped audio segments, I suppose the previous requirement of 128 fixed-length frames no longer holds as long as they are multiples of freq=32, so we instead zero-pad segments to match the length of the longest segment in the batch.

My concern is mostly about the order in which augmentation steps are performed.

Each segment has different factors For each audio in batch: Draw a random number L ~ U(1, 3) Split the audio into (audio length/L) segments <-- not sure what to do if the final segment is shorter than L? For each segment: Compress or stretch the segment using a factor of 0.7 - 1.35 Change the signal power between 10% and 100%
Segments from the same audio share the same factors: For each audio in batch: Compress or stretch the audio using a factor of 0.7 - 1.35 Change the signal power between 10% and 100% Draw a random number L ~ U(1, 3) Split the audio into (audio length/L) segments <-- not sure what to do if the final segment is shorter than L?

Jul 18 '20 17:07 tebin

There is no need to split the audio. The post-processing length is the same within the batch. Just index from the spectrogram. For example, [0, 0.5,1, 1.5] and [0, 2, 4, 6] are two instances in the same batch with length=4, where the former is stretched and the latter is compressed.

Jul 18 '20 23:07 auspicious3000

Recently, I was trying to improve origin autovc by using F0 information. Using 256-dimensional one-hot vectors in the original autovc seems to perform well. But in the process of improvement, I found that using a 256-dimensional one-hot vector seems to get a very low MOS score for speech. I want to know whether one-hot vector can be used in the improvement work based on F0 if zero-shot conversion is not done.

Jul 29 '20 02:07 Miralan

@Miralan Yes. If you have N speakers, just use N-dimensional one-hot embedding.

Jul 29 '20 02:07 auspicious3000

@Miralan Yes. If you have N speakers, just use N-dimensional one-hot embedding.

So if I do time-stretched and compressed with mel spectrogram. Should I do the same time-streched and compression for the fundamental frequency sequence?

Jul 29 '20 03:07 Miralan

Yes

Jul 30 '20 11:07 auspicious3000

Recently, I was trying to improve origin autovc by using F0 information. Using 256-dimensional one-hot vectors in the original autovc seems to perform well. But in the process of improvement, I found that using a 256-dimensional one-hot vector seems to get a very low MOS score for speech. I want to know whether one-hot vector can be used in the improvement work based on F0 if zero-shot conversion is not done.

Miralan, Very impressive to hear your doing MOS experiments with F0 information applied. Is there anywhere I can hear your generations? Would love to speak more about this!

Dec 12 '20 21:12 Trebolium

Recently, I was trying to improve origin autovc by using F0 information. Using 256-dimensional one-hot vectors in the original autovc seems to perform well. But in the process of improvement, I found that using a 256-dimensional one-hot vector seems to get a very low MOS score for speech. I want to know whether one-hot vector can be used in the improvement work based on F0 if zero-shot conversion is not done.

Miralan, Very impressive to hear your doing MOS experiments with F0 information applied. Is there anywhere I can hear your generations? Would love to speak more about this!

OK, I had done this expriment for long time, so I cann't find the results. But I have tried concante normalized f0s, it didn't work well, such as some content lack of wav. Maybe you can try crepe to extract f0s.

Dec 13 '20 03:12 Miralan

autovc autovc copied to clipboard

F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder

autovc
autovc copied to clipboard