autovc
autovc copied to clipboard
F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder
I'm trying to improve the model by implementing the pitch conditioning introduced in https://arxiv.org/abs/2004.07370. However the process of producing normalized quantized log-F0 seems a bit confusing, as there are more than one way you could compute mean µ and std σ.
A sample's pitch vector is a 1d array whose size is n, where n is the number of frames (which seems to be fixed at 128 according to https://github.com/auspicious3000/autovc/issues/6#issuecomment-509202251). So there are three ways of computing µ and σ:
Suppose f0 is extracted from a sample audio of speaker A.
- Compute µ and σ of each individual sample on the fly (
f0_norm = (f0 - f0.mean()) / f0.std() / 4
). - Compute µ and σ for each speaker (
f0_norm = (f0 - f0s.mean()) / f0s.std() / 4
where f0s is an A x 128 array with A being the total number of samples of speaker A). - Compute universal µ and σ of every sample (
f0_norm = (f0 - f0s.mean()) / f0s.std() / 4
where f0s is an N x 128 array with N being the total number of all samples - that is, A < N).
And assuming the answer is 2 or 3, for unseen-to-seen or unseen-to-unseen conversion am I correct that µ and σ should be stored somewhere safe so I can reuse those values for inference? (I guess the option 2 doesn't really make sense since you can't compute those for unseen speakers)
The answer is 2. You will need µ and σ for inference. However, for unseen speakers, you can normalize using its own µ and σ, which is not a bad approximation.
@auspicious3000 Thanks for the response! As a followup question, could you confirm whether the following pipeline for data augmentation is correct?
Since we are now using randomly cropped audio segments, I suppose the previous requirement of 128 fixed-length frames no longer holds as long as they are multiples of freq=32, so we instead zero-pad segments to match the length of the longest segment in the batch.
My concern is mostly about the order in which augmentation steps are performed.
-
Each segment has different factors For each audio in batch: Draw a random number L ~ U(1, 3) Split the audio into (audio length/L) segments <-- not sure what to do if the final segment is shorter than L? For each segment: Compress or stretch the segment using a factor of 0.7 - 1.35 Change the signal power between 10% and 100%
-
Segments from the same audio share the same factors: For each audio in batch: Compress or stretch the audio using a factor of 0.7 - 1.35 Change the signal power between 10% and 100% Draw a random number L ~ U(1, 3) Split the audio into (audio length/L) segments <-- not sure what to do if the final segment is shorter than L?
There is no need to split the audio. The post-processing length is the same within the batch. Just index from the spectrogram. For example, [0, 0.5,1, 1.5] and [0, 2, 4, 6] are two instances in the same batch with length=4, where the former is stretched and the latter is compressed.
Recently, I was trying to improve origin autovc by using F0 information. Using 256-dimensional one-hot vectors in the original autovc seems to perform well. But in the process of improvement, I found that using a 256-dimensional one-hot vector seems to get a very low MOS score for speech. I want to know whether one-hot vector can be used in the improvement work based on F0 if zero-shot conversion is not done.
@Miralan Yes. If you have N speakers, just use N-dimensional one-hot embedding.
@Miralan Yes. If you have N speakers, just use N-dimensional one-hot embedding.
So if I do time-stretched and compressed with mel spectrogram. Should I do the same time-streched and compression for the fundamental frequency sequence?
Yes
Recently, I was trying to improve origin autovc by using F0 information. Using 256-dimensional one-hot vectors in the original autovc seems to perform well. But in the process of improvement, I found that using a 256-dimensional one-hot vector seems to get a very low MOS score for speech. I want to know whether one-hot vector can be used in the improvement work based on F0 if zero-shot conversion is not done.
Miralan, Very impressive to hear your doing MOS experiments with F0 information applied. Is there anywhere I can hear your generations? Would love to speak more about this!
Recently, I was trying to improve origin autovc by using F0 information. Using 256-dimensional one-hot vectors in the original autovc seems to perform well. But in the process of improvement, I found that using a 256-dimensional one-hot vector seems to get a very low MOS score for speech. I want to know whether one-hot vector can be used in the improvement work based on F0 if zero-shot conversion is not done.
Miralan, Very impressive to hear your doing MOS experiments with F0 information applied. Is there anywhere I can hear your generations? Would love to speak more about this!
OK, I had done this expriment for long time, so I cann't find the results. But I have tried concante normalized f0s, it didn't work well, such as some content lack of wav. Maybe you can try crepe to extract f0s.