autovc How to get the same mel feature in "metadata.pkl"?

I use your default parameter and code to compute the mel feature of "p225_001.wav" in VCTK corpus. However, I get the dimension of mel feature is (385, 80) not the dimension (90 ,80) in "metadata.pkl". Do you have extra processing steps?

Mar 30 '21 14:03 gnipping

No I didn't.

Mar 30 '21 22:03 auspicious3000

So why the first dimension is not the same? And I use the mel feature whose feature is (385, 80) and your model, your speaker embedding in "metadata.pkl" to generate the audio "p225xp228", but only generate 6s strange voice. I cannot hear the word "please call stella". So how you reduce the dimension from 385 to 90?

Mar 31 '21 01:03 gnipping

The first dim is the number of frames. There is no dimension reduction. It should be around 90 for that utterance. Please double-check your code.

Mar 31 '21 01:03 auspicious3000

I use your code and your parameter in issue #4 to generate the mel feature, the hop_size is 256 and the result dimension is (385, 80). The code is below. If there has some bug, please point it, thanks!

import os import numpy as np from math import ceil import soundfile as sf from scipy import signal from scipy.signal import get_window from librosa.filters import mel

def butter_highpass(cutoff, fs, order=5): nyq = 0.5 * fs normal_cutoff = cutoff / nyq b, a = signal.butter(order, normal_cutoff, btype='high', analog=False) return b, a

def pySTFT(x, fft_length=1024, hop_length=256): x = np.pad(x, int(fft_length // 2), mode='reflect') noverlap = fft_length - hop_length shape = x.shape[:-1] + ((x.shape[-1] - noverlap) // hop_length, fft_length) strides = x.strides[:-1] + (hop_length * x.strides[-1], x.strides[-1]) result = np.lib.stride_tricks.as_strided(x, shape=shape,strides=strides) fft_window = get_window('hann', fft_length, fftbins=True) result = np.fft.rfft(fft_window * result, n=fft_length).T return np.abs(result)

mel_basis = mel(16000, 1024, fmin=90, fmax=7600, n_mels=80).T min_level = np.exp(-100 / 20 * np.log(10)) b, a = butter_highpass(30, 16000, order=5)

dirName = '../dataset/VCTK-Corpus/wav48' subdir = 'p225' fileName = 'p225_001.wav' x, fs = sf.read(os.path.join(dirName, subdir, fileName)) y = signal.filtfilt(b, a, x) wav = y D = pySTFT(wav).T D_mel = np.dot(D, mel_basis) D_db = 20 * np.log10(np.maximum(min_level, D_mel)) - 16 S = np.clip((D_db + 100) / 100, 0, 1)

print(S.shape)

Mar 31 '21 04:03 gnipping

The sampling rate should be 16k instead of 48k

Mar 31 '21 04:03 auspicious3000

Thank you!

Mar 31 '21 04:03 gnipping

I have another question: I use the following code to replace the soundfile to read the data

x, fs = librosa.load(os.path.join(dirName, subdir, fileName), sr=16000)

However, the final dimension is (129, 80) still not the (90, 80)

Mar 31 '21 04:03 gnipping

Thank you!

     Hello, I met the same question as you. I'd like to generate my own "metadata.pkl" file to convert the voice from the little training example provided by the author (e.g. "\wavs\p225\p225_003.wav"), thus I tried to use "make_spect.py" to generate speech Mel-spectrogram by myself. However, my result is "(376, 80)", not "(90, 80)".
     I have noticed that you asked the author the same question and got the answer as "change the sampling rate from 48k to 16k". However, your code parameters use 16k as sr, not 48k, which I feel confused and I'd like to know how you solve that issue?
     Thank you very much!

Apr 09 '21 02:04 hongchengzhu

Thank you!

     Hello, I met the same question as you. I'd like to generate my own "metadata.pkl" file to convert the voice from the little training example provided by the author (e.g. "\wavs\p225\p225_003.wav"), thus I tried to use "make_spect.py" to generate speech Mel-spectrogram by myself. However, my result is "(376, 80)", not "(90, 80)".
     I have noticed that you asked the author the same question and got the answer as "change the sampling rate from 48k to 16k". However, your code parameters use 16k as sr, not 48k, which I feel confused and I'd like to know how you solve that issue?
     Thank you very much!

This is because I use the dataset VCTK corpus download from https://datashare.ed.ac.uk/handle/10283/2950. In there, I do not find the sound whose frequency is 16K, so I use the sound frequency 48K, and use that code to read it.

Apr 09 '21 02:04 gnipping

This is because I use the dataset VCTK corpus download from https://datashare.ed.ac.uk/handle/10283/2950. In there, I do not find the sound whose frequency is 16K, so I use the sound frequency 48K, and use that code to read it.

Do you mean that you used the VCTK dataset whose sr is 48k, then you downsampled them to 16k, by which you get the (90, 80) Mel-spectrogram to convert? Well, that's very strange, the author's provided little training samples(in "\wav\p225\p225_003.wav") owns 16k sr, but get the (376, 80).

And further, have you ever tried to converted voice by your own "metadata.pkl"? If you have, could you please give me some advice? I'm a freshman in VC thus don't know much about how to build my model. Thank you again!

Apr 09 '21 03:04 hongchengzhu

This is because I use the dataset VCTK corpus download from https://datashare.ed.ac.uk/handle/10283/2950. In there, I do not find the sound whose frequency is 16K, so I use the sound frequency 48K, and use that code to read it.

Do you mean that you used the VCTK dataset whose sr is 48k, then you downsampled them to 16k, by which you get the (90, 80) Mel-spectrogram to convert? Well, that's very strange, the author's provided little training samples(in "\wav\p225\p225_003.wav") owns 16k sr, but get the (376, 80).

And further, have you ever tried to converted voice by your own "metadata.pkl"? If you have, could you please give me some advice? I'm a freshman in VC thus don't know much about how to build my model. Thank you again!

I don not get the shape (90, 80), I get the shape (129, 80) instead. The complete code is replace the code "x, fs = sf.read(os.path.join(dirName, subdir, fileName))" to "x, fs = librosa.load(os.path.join(dirName, subdir, fileName), sr=16000)". And you will get my result if you use the same dataset downloaded from the link I had given.

Apr 09 '21 03:04 gnipping

I get shape (129, 80) as well. Any update on this?

Aug 10 '21 02:08 antovespoli3

The length does not have to be 90. As long as the sampling frequency is correct, it should be fine.

Aug 10 '21 03:08 auspicious3000

Many thanks for your prompt reply. Unfortunately, I noticed that the audio quality is not as good. Is there any chance you used a particular procedure for downsampling to 16kHz? Or maybe you performed some preprocessing while downsampling?

Thanks

Aug 10 '21 03:08 antovespoli3

No. and the procedures for downsampling should not make a big difference.

Aug 10 '21 03:08 auspicious3000

The reason why I thought about some additional preprocessing is that by analysing the spectrograms I noticed some differences between the original dataset and your version.

Below is the spectrogram that I computed starting from the original dataset, downsampling to 16kHz, and applying make_spect.py (shape 119*80)

Below is the spectrogram for p225_001 that you included in metadata.pkl (shape 90*80)

Below is the spectrogram that I computed starting from the file that you host on the demo page (https://auspicious3000.github.io/autovc-demo/audios/ground_truth1/p225_001.wav), downsampling to 16kHz (originally at 22050Hz), and applying make_spect.py (shape 90*80)

I don't understand why your files produce almost identical spectrograms, while if we start from the original dataset we get significantly different results.

The audio quality is affected as well:

"p225xp225 (8).wav" is the audio generated by the original dataset "p225xp225 (7).wav" is the audio generated by the metadata.pkl in this repository

audio_files.zip

Do you have any idea of what could be the difference between your files and the files in the original dataset?

Aug 10 '21 05:08 antovespoli3

I finally found that the difference is the trimming at the head and tail of the audio. I reproduced an almost identical file by "trimming it by hand", but I couldn't find the exact silence trimming procedure that you used.

Aug 10 '21 16:08 antovespoli3

OK. That explains it. I trimmed the silence off by hand.

Aug 10 '21 17:08 auspicious3000

OK. That explains it. I trimmed the silence off by hand.

@auspicious3000 You mean you trimmed the silence part off from whole VCTK dataset by hand to generate your training dataset and train the model?

Nov 09 '22 07:11 MHVali

autovc autovc copied to clipboard

How to get the same mel feature in "metadata.pkl"?

autovc
autovc copied to clipboard