java-speech-api
java-speech-api copied to clipboard
bytesToDoubleArray() sizing & FFT
As per the recommendations of Moattar and Homayounpour I'm trying to detect voice activity using a 10ms sliding window.
For 10ms of 16kHz 16bit mono audio, getNumBytes(.01) returns 320. (it would be 320.5, but it is stored in an int)
...why add the .5?
public int getNumOfBytes(double seconds) {
AudioFormat format = getAudioFormat();
return (int)(seconds * format.getSampleRate() * format.getFrameSize() + .5);
}
then getFrequency() calls bytesToDoubleArray(), passing the 320 bytes. Another point of confusion is the calculation of the size of micBufferData:
double[] micBufferData = new double[bytesRecorded - bytesPerSample +1];
for (int index = 0, floatIndex = 0; index < bytesRecorded - bytesPerSample + 1; index += bytesPerSample, floatIndex++) {
...
micBufferData[floatIndex] = sample32;
}
with 2 bytesPerSample, the code has allocated space for 319 doubles, but when it's done everything after bytesPerSample[159] is 0.0
back in getFrequency() I end up with an array of 319 Complex values, but again, everything after 159 is 0.0, 0.0
In FFT() you check:
// radix 2 Cooley-Tukey FFT
if (N % 2 != 0) { throw new RuntimeException("N is not a power of 2"); }
...At first I thought "that's not checking if it is a power of 2", but then you call it recursively, this would eventually be a valid test. As it happens, the excheption is thrown the first time through because I've got 160 values in an array with capacity for 319.
I've changed my window size to 8ms and removed the "+1" mentioned above, but now when FFT returns the first element always a 0.0 imaginary component, and as a result findMaxMagnitude() finds a huge value at index 0 and votes it as the top result - so the frequency is always 0 and my VAD never detects any speech
Open issues here https://github.com/goxr3plus/java-google-speech-api