sound-cnn icon indicating copy to clipboard operation
sound-cnn copied to clipboard

We did some similar job about Audio Recognition Using Spectrogram As Input

Open microic opened this issue 6 years ago • 0 comments

By using spectrogram, we can simply treat audio recognition as image recognition, so all the technologies used at image recognition can be used at audio recognition as well

This example uses the same samples as speech_commands, that is wave audio files of people saying ten different words('yes', 'no', 'up', 'down', 'left', 'right', 'on', 'off', 'stop', 'go') from Speech Commands dataset

First, transform the original audio files into spectrograms, we use the following python code to do the task:

import numpy as np
from scipy.io import wavfile

X_SIZE = 16000
IMG_SIZE = 128

def spectrogram(filepath):
	framerate, wav_data = wavfile.read(filepath)

	window_length = 512
	window_shift = 121

	if len(wav_data) > X_SIZE:
		wav_data = wav_data[:X_SIZE]

	X = np.zeros(X_SIZE).astype('float32')
	X[:len(wav_data)] += wav_data
	spec = np.zeros((IMG_SIZE, IMG_SIZE)).astype('float32')

	for i in range(IMG_SIZE):
		start = i * window_shift
		end = start + window_length
		sig = np.abs(np.fft.rfft(X[start:end] * np.hanning(window_length)))
		spec[:,i] = (sig[1:IMG_SIZE + 1])[::-1]

	spec = (spec-spec.min())/(spec.max()-spec.min())
	spec = np.log10((spec * 100 + 0.01))
	spec = (spec-spec.min())/(spec.max()-spec.min()) - 0.5

	return spec

For example, the spectrogram for test.wav is:

To reduce the amount of computation, we limit the size of spectrogram to 128×128, so the model can be defined as follow:

Run this model, you will get a test accuracy around 93%(while train accuracy is 98%) after 15 epochs

Reference

Github Link

microic avatar Mar 06 '18 03:03 microic