soundnet Size of feature

@cvondrick @yusufaytar Thank you very much for sharing this code.

I am new to audio. I was trying extract features from my audio files. Feature size varies depending upon the length of the video, what does it depend upon? I want to use the feature for classification. How do I get constant dimension feature vector for all of my audio files? For example, an audio file with 1476864 samples produces feature of dimension [1x1024x46x1] and other files with 2199168 samples produce a feature of dimension [1x1024x68x1]. How do get constant dimension feature vector for both files?

How do I have to modify sound signal to apply net in sliding window fashion in the temporal direction?

Jun 21 '17 00:06 gurkirt

You can apply global average pooling.

Jul 29 '17 16:07 keunwoochoi

@gurkirt Each frame of feature only describes a short segment of the audio. If you want to do recording-level classification, you need to compute recording-level statistics of the frame-level features. Averaging is one of such statistics; others include standard variance, min, max, etc.

Jul 30 '17 00:07 MaigoAkisame

Thanks, guys for replying, I agree with you both. I figured it out later after supply few audio with different temporal duration. Cheers Gurkirt

Aug 05 '17 11:08 gurkirt

Hi Guys! First of all, thank you for releasing the model and codes, they have been very useful for research we're pursuing now.

My question has also to do with the size of the features. We have two datasets of eight-second audio clips. I used the provided script to extract features from the clips in the first dataset and the output dimensions were 1024x5. However, when running the script for the second database, I got 1024x6. All the clips are of eight seconds in duration.

Clips in the second database are a little higher on bitrate (352 kb/s) and the sampling frequency is 22000Hz. For clips in first dataset, bitrate is 256 kb/s and the sampling frequency is 16000Hz.

Might this be the cause?

Thanks in advance for any guidance you can provide.

Jan 23 '18 21:01 hcen001

5 or 6 depends upon the number of samples in input irrespective of duration. If bitrate is higher then you would get more output features. I would suggest fixing the bit rate of inputs using ffmpeg or any other subsampling or supersampling method.

Jan 24 '18 07:01 gurkirt

@gurkirt Thanks a lot, your answer helped a lot. We had some inconsistencies with the sampling frequency.

Jan 24 '18 14:01 hcen001

The configuration for first layer in the paper is mentioned as { Dim : 220050 , # of filters : 16 , Filter size : 64 , stride : 2 , padding : 32 } and in next pool layer dim becomes 27,506. Can someone explain this transition mathematically ?

Jan 11 '19 06:01 pranavgupta1234

@pranavgupta1234 27,506 appears to be 220,050 / 8. But I agree that the configuration of the first layer doesn't seem to yield this number. I have also tabulated the structure of SoundNet in this paper (top of Page 2): https://maigoakisame.github.io/papers/interspeech17.pdf And indeed the number 27,506 doesn't appear...

Jan 11 '19 19:01 MaigoAkisame

Hi, Is there any minimum length of the audio input file? I am getting errors when extracting features on 3 seconds of audio file(mp3). However, there is no error when using audio of length >=10 seconds.

I am getting the following error : warnings.warn('PySoundFile failed. Trying audioread instead.') Traceback (most recent call last): File "extract_feat.py", line 86, in sound_samples = load_from_txt(args.audio_txt, config=local_config) File "/home/pratibha/SoundNet-tensorflow/util.py", line 31, in load_from_txt audios.append(preprocess(sound_sample, config)) File "/home/pratibha/SoundNet-tensorflow/util.py", line 59, in preprocess raw_audio = np.tile(raw_audio, length/raw_audio.shape[0] + 1) File "/home/pratibha/anaconda3/envs/soundNet_env/lib/python3.5/site-packages/numpy/lib/shape_base.py", line 1157, in tile return c.reshape(shape_out) TypeError: 'float' object cannot be interpreted as an integer

Sep 14 '20 12:09 2017csz0006

Hi, Is there any minimum length of the audio input file? I am getting errors when extracting features on 3 seconds of audio file(mp3). However, there is no error when using audio of length >=10 seconds.

I am getting the following error : warnings.warn('PySoundFile failed. Trying audioread instead.') Traceback (most recent call last): File "extract_feat.py", line 86, in sound_samples = load_from_txt(args.audio_txt, config=local_config) File "/home/pratibha/SoundNet-tensorflow/util.py", line 31, in load_from_txt audios.append(preprocess(sound_sample, config)) File "/home/pratibha/SoundNet-tensorflow/util.py", line 59, in preprocess raw_audio = np.tile(raw_audio, length/raw_audio.shape[0] + 1) File "/home/pratibha/anaconda3/envs/soundNet_env/lib/python3.5/site-packages/numpy/lib/shape_base.py", line 1157, in tile return c.reshape(shape_out) TypeError: 'float' object cannot be interpreted as an integer

I have encountered this problem as well. have you found the reason. but if based on the description of the paper, it also can use ESC datasets which contains less than 10s as well. I am wondering if only the TensorFlow version of the code has this problem.

Aug 05 '21 02:08 gancx

soundnet soundnet copied to clipboard

Size of feature

soundnet
soundnet copied to clipboard