AutoPST Inference with new input audio

Hi and thank you for this amazing project!

I was trying to create a notebook in colab that would allow me to input an audio file, then select the speaker and produce an output accordingly.

Here the code, it works but I am missing the part on how to change speaker timbre. Do you have any tips on that?

Thanks a lot in advance!

Sep 25 '21 21:09 shoegazerstella

By conditioning on speaker embedding, it changes the rhythm and timbre at the same time.

Sep 25 '21 22:09 auspicious3000

Hi @auspicious3000 I am computing the spectrogram like this:

def butter_highpass(cutoff, fs, order=5):
    nyq = 0.5 * fs
    normal_cutoff = cutoff / nyq
    b, a = signal.butter(order, normal_cutoff, btype='high', analog=False)
    return b, a
    
    
def pySTFT(x, fft_length=1024, hop_length=256):
    
    x = np.pad(x, int(fft_length//2), mode='reflect')
    
    noverlap = fft_length - hop_length
    shape = x.shape[:-1]+((x.shape[-1]-noverlap)//hop_length, fft_length)
    strides = x.strides[:-1]+(hop_length*x.strides[-1], x.strides[-1])
    result = np.lib.stride_tricks.as_strided(x, shape=shape,
                                             strides=strides)
    
    fft_window = get_window('hann', fft_length, fftbins=True)
    result = np.fft.rfft(fft_window * result, n=fft_length).T
    
    return np.abs(result)    
    
    
mel_basis = mel(16000, 1024, fmin=90, fmax=7600, n_mels=80).T
min_level = np.exp(-100 / 20 * np.log(10))
b, a = butter_highpass(30, 16000, order=5)

AUDIO_DIR = "/content/drive/MyDrive/CODE/VoiceAE/audio/"
filename = os.path.join(AUDIO_DIR, "test.wav")

# Read audio file
#x, fs = sf.read(filename) sr=16000
x, fs = librosa.load(filename, duration=4)

# Remove drifting noise
#y = signal.filtfilt(b, a, x)
y = x

# Ddd a little random noise for model roubstness
#wav = y * 0.96 + (prng.rand(y.shape[0])-0.5)*1e-06
wav = y

# Compute spect
D = pySTFT(wav).T

# Convert to mel and normalize
D_mel = np.dot(D, mel_basis)
D_db = 20 * np.log10(np.maximum(min_level, D_mel)) - 16
S = np.clip((D_db + 100) / 100, 0, 1)    

# save spect    
np.save(os.path.join(AUDIO_DIR, filename[:-4]), S.astype(np.float32), allow_pickle=False)

Then I do the inference like this:

waveform = wavegen(model, c=S)

But of course something is missing, how can I condition the embedding on a specific speaker? Could you point me to some code please?

Sep 25 '21 22:09 shoegazerstella

with torch.no_grad(): spect_output, len_spect = P.infer_onmt(cep_real_A.transpose(2,1)[:,:14,:], real_mask_A, len_real_A, spk_emb_B)

Sep 26 '21 00:09 auspicious3000

@shoegazerstella figured it out?

Oct 11 '21 21:10 charan223

AutoPST AutoPST copied to clipboard

Inference with new input audio

AutoPST
AutoPST copied to clipboard