autovc
autovc copied to clipboard
How to get the same mel feature in "metadata.pkl"?
I use your default parameter and code to compute the mel feature of "p225_001.wav" in VCTK corpus. However, I get the dimension of mel feature is (385, 80) not the dimension (90 ,80) in "metadata.pkl". Do you have extra processing steps?
No I didn't.
So why the first dimension is not the same? And I use the mel feature whose feature is (385, 80) and your model, your speaker embedding in "metadata.pkl" to generate the audio "p225xp228", but only generate 6s strange voice. I cannot hear the word "please call stella". So how you reduce the dimension from 385 to 90?
The first dim is the number of frames. There is no dimension reduction. It should be around 90 for that utterance. Please double-check your code.
I use your code and your parameter in issue #4 to generate the mel feature, the hop_size is 256 and the result dimension is (385, 80). The code is below. If there has some bug, please point it, thanks!
import os import numpy as np from math import ceil import soundfile as sf from scipy import signal from scipy.signal import get_window from librosa.filters import mel
def butter_highpass(cutoff, fs, order=5): nyq = 0.5 * fs normal_cutoff = cutoff / nyq b, a = signal.butter(order, normal_cutoff, btype='high', analog=False) return b, a
def pySTFT(x, fft_length=1024, hop_length=256): x = np.pad(x, int(fft_length // 2), mode='reflect') noverlap = fft_length - hop_length shape = x.shape[:-1] + ((x.shape[-1] - noverlap) // hop_length, fft_length) strides = x.strides[:-1] + (hop_length * x.strides[-1], x.strides[-1]) result = np.lib.stride_tricks.as_strided(x, shape=shape,strides=strides) fft_window = get_window('hann', fft_length, fftbins=True) result = np.fft.rfft(fft_window * result, n=fft_length).T return np.abs(result)
mel_basis = mel(16000, 1024, fmin=90, fmax=7600, n_mels=80).T min_level = np.exp(-100 / 20 * np.log(10)) b, a = butter_highpass(30, 16000, order=5)
dirName = '../dataset/VCTK-Corpus/wav48' subdir = 'p225' fileName = 'p225_001.wav' x, fs = sf.read(os.path.join(dirName, subdir, fileName)) y = signal.filtfilt(b, a, x) wav = y D = pySTFT(wav).T D_mel = np.dot(D, mel_basis) D_db = 20 * np.log10(np.maximum(min_level, D_mel)) - 16 S = np.clip((D_db + 100) / 100, 0, 1)
print(S.shape)
The sampling rate should be 16k instead of 48k
Thank you!
I have another question: I use the following code to replace the soundfile to read the data
x, fs = librosa.load(os.path.join(dirName, subdir, fileName), sr=16000)
However, the final dimension is (129, 80) still not the (90, 80)
Thank you!
Hello, I met the same question as you. I'd like to generate my own "metadata.pkl" file to convert the voice from the little training example provided by the author (e.g. "\wavs\p225\p225_003.wav"), thus I tried to use "make_spect.py" to generate speech Mel-spectrogram by myself. However, my result is "(376, 80)", not "(90, 80)".
I have noticed that you asked the author the same question and got the answer as "change the sampling rate from 48k to 16k". However, your code parameters use 16k as sr, not 48k, which I feel confused and I'd like to know how you solve that issue?
Thank you very much!
Thank you!
Hello, I met the same question as you. I'd like to generate my own "metadata.pkl" file to convert the voice from the little training example provided by the author (e.g. "\wavs\p225\p225_003.wav"), thus I tried to use "make_spect.py" to generate speech Mel-spectrogram by myself. However, my result is "(376, 80)", not "(90, 80)". I have noticed that you asked the author the same question and got the answer as "change the sampling rate from 48k to 16k". However, your code parameters use 16k as sr, not 48k, which I feel confused and I'd like to know how you solve that issue? Thank you very much!
This is because I use the dataset VCTK corpus download from https://datashare.ed.ac.uk/handle/10283/2950. In there, I do not find the sound whose frequency is 16K, so I use the sound frequency 48K, and use that code to read it.
This is because I use the dataset VCTK corpus download from https://datashare.ed.ac.uk/handle/10283/2950. In there, I do not find the sound whose frequency is 16K, so I use the sound frequency 48K, and use that code to read it.
Do you mean that you used the VCTK dataset whose sr is 48k, then you downsampled them to 16k, by which you get the (90, 80) Mel-spectrogram to convert? Well, that's very strange, the author's provided little training samples(in "\wav\p225\p225_003.wav") owns 16k sr, but get the (376, 80).
And further, have you ever tried to converted voice by your own "metadata.pkl"? If you have, could you please give me some advice? I'm a freshman in VC thus don't know much about how to build my model. Thank you again!
This is because I use the dataset VCTK corpus download from https://datashare.ed.ac.uk/handle/10283/2950. In there, I do not find the sound whose frequency is 16K, so I use the sound frequency 48K, and use that code to read it.
Do you mean that you used the VCTK dataset whose sr is 48k, then you downsampled them to 16k, by which you get the (90, 80) Mel-spectrogram to convert? Well, that's very strange, the author's provided little training samples(in "\wav\p225\p225_003.wav") owns 16k sr, but get the (376, 80).
And further, have you ever tried to converted voice by your own "metadata.pkl"? If you have, could you please give me some advice? I'm a freshman in VC thus don't know much about how to build my model. Thank you again!
I don not get the shape (90, 80), I get the shape (129, 80) instead. The complete code is replace the code "x, fs = sf.read(os.path.join(dirName, subdir, fileName))" to "x, fs = librosa.load(os.path.join(dirName, subdir, fileName), sr=16000)". And you will get my result if you use the same dataset downloaded from the link I had given.
I get shape (129, 80) as well. Any update on this?
The length does not have to be 90. As long as the sampling frequency is correct, it should be fine.
Many thanks for your prompt reply. Unfortunately, I noticed that the audio quality is not as good. Is there any chance you used a particular procedure for downsampling to 16kHz? Or maybe you performed some preprocessing while downsampling?
Thanks
No. and the procedures for downsampling should not make a big difference.
The reason why I thought about some additional preprocessing is that by analysing the spectrograms I noticed some differences between the original dataset and your version.
Below is the spectrogram that I computed starting from the original dataset, downsampling to 16kHz, and applying make_spect.py (shape 119*80)
Below is the spectrogram for p225_001 that you included in metadata.pkl (shape 90*80)
Below is the spectrogram that I computed starting from the file that you host on the demo page (https://auspicious3000.github.io/autovc-demo/audios/ground_truth1/p225_001.wav), downsampling to 16kHz (originally at 22050Hz), and applying make_spect.py (shape 90*80)
I don't understand why your files produce almost identical spectrograms, while if we start from the original dataset we get significantly different results.
The audio quality is affected as well:
"p225xp225 (8).wav" is the audio generated by the original dataset "p225xp225 (7).wav" is the audio generated by the metadata.pkl in this repository
Do you have any idea of what could be the difference between your files and the files in the original dataset?
I finally found that the difference is the trimming at the head and tail of the audio. I reproduced an almost identical file by "trimming it by hand", but I couldn't find the exact silence trimming procedure that you used.
OK. That explains it. I trimmed the silence off by hand.
OK. That explains it. I trimmed the silence off by hand.
@auspicious3000 You mean you trimmed the silence part off from whole VCTK dataset by hand to generate your training dataset and train the model?