pystoi Weird STOI Output

Hi,

Recently I was trying to evaluate some signals by calculating the stoi of each signals with this package. I used pystoi.stoi.stoi function to calculate the stoi. When I input two identical signals as ref_signal and processed_signal, it output 1 perfectly. However, when I replaced processed signal with microphone signals I recorded with and without background music playing, it turned out that the STOI of the signal when background music was presented is always higher, which made no sense. I'm wondering if I'm using the function the wrong way or is there anything wrong with my audio file or understanding about STOI.

I've uploaded my audio files at the following website as well as my code to evaluate STOI. https://github.com/nanaChang/stoiCheckFile

Thank you!

Jul 24 '20 06:07 nanaChang

I didn't check your files but are you sure they are completely synced between each other? If they are aligned, this results sounds weird indeed.

Jul 24 '20 09:07 mpariente

Hi,

Thank for replying! I'm pretty sure they are all aligned correctly. There should be a tiny delay between reference signal and microphone received signal considering the traveling time from speaker to my microphone array but I placed my speaker and my microphone pretty close to each other (0.3 meters apart) so I think this hardly affect the result. Anyway, considering possible misalignment I tried to take the frame delay into accounts and recalculate the STOI with reference signal delayed 14 frames in order to minimize the effects of traveling time, but the results seem alike to the original ones.

Thank you again for reviewing my issues!

Jul 27 '20 01:07 nanaChang

That's counter-intuitive.. Do you have Matlab by any chance? The code is unit tested but maybe something weird happens IDK..

Jul 27 '20 07:07 mpariente

I just ran the Matlab tests code and got 0.1973 for the signal with background music with 0.1105 for the signal w/o background music. I'm thinking that is it possible that all my signals are too noisy so that with or without background music couldn't be indicated through STOI due to the noises. However, still want to bring this up since I got about 100 of signals with and without background music and almost all of them turned out to have higher STOI when there is BGM presented.

Thank you!

Jul 29 '20 01:07 nanaChang

Thanks a lot for running the tests in Matlab ! This is indeed a very interesting observation, @chtaal might have an explanation for it.

I don't have any intuition as to why this would be the case, sorry..

Jul 29 '20 19:07 mpariente

Your scores are below 0.4 which basically means STOI says the speech is not intelligible. Have a look at fig4 in http://cas.et.tudelft.nl/pubs/Taal2011_1.pdf where you see real listening test scores vs STOI predictions.

You have to call STOI with a clean signal and a distorted version (less intelligible) of the SAME speech signal. The signals have to be time-aligned. Based on the file size it seems that 'refSpeech.wav' might be not the same time-aligned speech signal as the one used in audio_withBGM.wav? I think it would make more sense if you use audio_withoutBGM.wav as the reference signal (assuming it's 100% intelligible) and audio_withBGM.wav as the distorted version.

Jul 30 '20 07:07 chtaal

Hi @chtaal !

Thank you for replying! The reason why the file size of audio_withBGM.wav and audio_withoutBGM.wav is much larger than refSpeech.wav is that the audio files are recorded with 4-channel microphone array, thus the file size is about 4 times larger than 1-channel refSpeech.wav. However, I chose only channel 2 when calculating STOI so this shouldn't be a problem I supposed.

I've tried to evaluate STOI with a couple of my processed audio files as processed audio and audio_withoutBGM.wav as reference as you suggested, and it turned out that the trend is more similar to the results of PESQ (Thank you again for this helpful suggestion!). Yet I still feel weird since the audio_withoutBGM.wav is the raw signal I would like to test Beamforming algorithm on it. If this audio file instead of refSpeech.wav is taken as ref signal, how can I understand the audio quality of my beamforming algorithm with raw microphone signal as reference?

Thank you again!

Aug 03 '20 01:08 nanaChang

pystoi pystoi copied to clipboard

Weird STOI Output

pystoi
pystoi copied to clipboard