learning2listen
learning2listen copied to clipboard
RE. Unable to reproduce audio 128-D mel spectrogram feature from raw video
Problem statement
I am trying to reproduce the audio feature pre-processing for a longer time-window sequence experiment, but the only available detailed instructions were from #2. However, in the answer, the script seemed to extract the MFCC features from an extracted? audio which returned an output with a different shape (1 x 4T x 20) compared to the audio feature in the dataset (1 x 4T x 128).
Issue reproduction
My snippet on Google Collab could be found HERE
Yes, we chose 4*T to allow for temporal alignment with the 30fps framerate of the videos just to make it easier to process both the audio and the video frames in a unified way. The T here refers to the number of frames in the video clip. So for the purposes of this paper, T=64. The exact code used to calculate the melspecs is as follows:
def load_mfcc(audio_path, num_frames): waveform, sample_rate = librosa.load('{}'.format(audio_path), sr=16000) win_len = int(0.025*sample_rate) hop_len = int(0.010*sample_rate) fft_len = 2 ** int(np.ceil(np.log(win_len) / np.log(2.0))) S_dB = librosa.feature.mfcc(y=waveform, sr=sample_rate, hop_length=hop_len) # This line by default only extract 20 MFCCs ## do some resizing to match frame rate im = Image.fromarray(S_dB) _, feature_dim = im.size scale_four = num_frames*4 im = im.resize((scale_four, feature_dim), Image.ANTIALIAS) S_dB = np.array(im) return S_dBHope this helps!
I also tried to extract the Mel spectrogram normally and even combined it with librosa's power_to_db but the scale between my output and the original dataset was still somehow not correct.
S_dB = librosa.feature.melspectrogram(y=waveform, sr=sample_rate)
# optional
# S_dB = librosa.power_to_db(S_dB)
Below are the expected output and outputs from the Mel spectrogram function before and after power_to_db. I extracted them from the same video file done_conan_videos0/021LukeWilsonIsStartingToLookLikeChrisGainesCONANonTBS from the raw dataset based on metadata provided by *_files_clean*. I assumed that correct preprocessing would produce the same output as your original dataset.
My output
array([3.49407317e-04, 1.72899290e-05, 9.88764441e-06, 9.31489740e-06, 2.19979029e-05, 4.02382248e-05, 5.83300316e-05, 1.78770599e-04,
After powering
array([-34.508053, -46.10779 , -48.621204, -49.872578, -46.910652, -42.93151 , -41.84772 , -37.57675 , -38.1189 , -38.486935,
Here's the dataset
array([[-50.593018, -47.35103 , -45.426086, -41.643738, -42.111137, -41.75349 , -41.146526, -38.722565, -39.55792 , -39.344612,
My question
May I ask your advice on how to extract audio features in detail for reproducing the dataset? I believe other readers shared the same question with me--- see THIS
Hi, I would like to know if you have figured out this now. I've tried to make the audio like (1, 4T, 128) as Mel spectrogram with specific hop_length and n_mels parameter within librosa.feature.melspectrogram function.
Running the provided script to produce MFCC feature, it should be with (1, 4T, 20) instead of 128. If you have more understanding about the reason to use MFCC and to create feature with 20-dimension, please let me know :)
Hello! Thank you for your interest in our work! I've updated the comment in #2 with the exact audio extraction method. Please let me know if that helps. In general, I applied the same type of scaling and such as with the mfcc code, but swap that line with the melspec code.