soundnet_keras icon indicating copy to clipboard operation
soundnet_keras copied to clipboard

issue with short audios

Open zmahoor opened this issue 7 years ago • 15 comments
trafficstars

the program crashes for short audio clips (less than 4 seconds). Any thoughts what could be wrong? is there any requirements on the input length? W tensorflow/core/framework/op_kernel.cc:1202] OP_REQUIRES failed at conv_ops.cc:384 : Invalid argument: computed output size would be negative [[Node: conv1d_8/convolution/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](conv1d_8/convolution/ExpandDims, conv1d_8/convolution/ExpandDims_1)]]

zmahoor avatar Mar 04 '18 13:03 zmahoor

Yeah that's a limitation of the original paper because of the convolution receptive fields. You could fix it with padding your input?

pseeth avatar Mar 04 '18 20:03 pseeth

I found a tensorflow version of soundnet which still works with short audios. I am sure the code does not pad the audio input because I commented it out the part it was padding! The only difference I found is your code is using conv1D, and the other one is using conv2D. Not sure if it matters.

zmahoor avatar Mar 04 '18 21:03 zmahoor

Hi! I have found a Tensorflow version but it seems it does not work with short audios (https://github.com/eborboihuc/SoundNet-tensorflow)... Can you post the link that works with short audios? Thanks in advance

janaal1 avatar Oct 23 '18 11:10 janaal1

I am facing a similar issue with short videos (less than 2 seconds). So is it a good idea to do zero padding to the audio clip in order to extract the features? Or is there any other version than I can use for extracting the features?

kristosh avatar Oct 31 '18 14:10 kristosh

I did not add zeros in my solution. I just changed the window parameters from the repo I linked before. I do not know the scores because I have not run the final application yet.

I cannot give you I final answer? What do you think about my aproximation?

janaal1 avatar Oct 31 '18 14:10 janaal1

Why then not to change the sample_rate during loading audio clips? Is there a chance that this can be the solution too?

kristosh avatar Oct 31 '18 14:10 kristosh

I also changed the sample_rate. Sorry I forgot to mention it. I changed both parameters: window_size and sample_rate in order to extract features from my audios

janaal1 avatar Oct 31 '18 14:10 janaal1

But why not from the keras version? I am trying to read my wav files and the returning size after prediction (i change the sample rate 3 times bigger) is (1, 0, 401) ?? Any idea why that is happening? Have you came across a similar issue?

kristosh avatar Oct 31 '18 14:10 kristosh

I found keras version is less stable.... Can you paste your cose so I can see what you have modified please?

janaal1 avatar Oct 31 '18 14:10 janaal1

I modified the line 29 and the sample rate (to be 3*sample_rate). I have a feeling though that my issue is in the raw file itself. When I am using the sample file it works properly. When I am using my own file which is short (around 2 sec) I am getting the error you mentioned, otherwise if I change the sample rate (or for example concatenate three times the input audio vector) the result is a vector with size (1, 0, 401).

kristosh avatar Oct 31 '18 14:10 kristosh

That is very weird. Let's keep in touch with this issue. I think next week I will be able to work in the Keras gitHub to check if I can solve the problem in the same way as in Tensorflow

janaal1 avatar Oct 31 '18 14:10 janaal1

So it is recommending to check the tensorflow version using sound8.npy with my file, right? In that case do I need to change the way I am reading the wav files (is that version dedicated to mp3)?

kristosh avatar Oct 31 '18 14:10 kristosh

Yes, that is right. I could extract features from both extensions: mp3 and wav

janaal1 avatar Oct 31 '18 15:10 janaal1

By using the tensorflow code I have another issue when I add my files in the text file, load_from_txt complaints and returns the following error: *** TypeError: 'float' object cannot be interpreted as an integer

kristosh avatar Oct 31 '18 16:10 kristosh

@kristosh you could change this syntax in util.py: from raw_audio = np.tile(raw_audio, length/raw_audio.shape[0] + 1) to raw_audio = np.tile(raw_audio, length//raw_audio.shape[0] + 1)

sudonto avatar Nov 13 '18 05:11 sudonto