NeMo Input features for MarbleNet-VAD

I am performing VAD using examples/asr/vad_infer.py . I see that, from the code, MarbleNet directly operates on raw audio data using convolution filters. There is no separate feature extraction from the audio.

But , when I see the published MarbleNet paper from Nvidia, looks like first MFCC is extracted from audio and then model operates on those MFCCs.

Am I correct with this observation ?

Jul 21 '22 03:07 bchinnari

HI, can somebody clarify ?

Aug 10 '22 11:08 bchinnari

@bchinnari Sorry I missed the issue and thanks for the question.

No. MarbleNet is using MFCC in vad_infer.py. Pleasev have a look at model yaml file

For vad_multilingual_marblenet (the PR for inference postprocessing yaml will be submitted this week ) is with Mel Spectrogram.

Aug 10 '22 17:08 fayejf

Thanks for the response. I had that question because of the following reason I am running VAD on a test file as follows

python examples/asr/speech_classification/vad_infer.py --config-path="../conf/vad" --config-name="vad_inference_postprocessing.yaml" dataset=test.json

-bash-4.2$ cat test.json 
{"audio_filepath": "out.wav", "offset": 0, "duration": null, "label": "infer", "text": "-"}

out.wav is 2.9 sec long, I am using 0.3 size window (0.3x16000 =4800 samples), shift of 0.05s (i.e., 20 windows per sec, full audio has 2.9x20 =58 windows ) . When I check the shape of the data in https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/parts/utils/vad_utils.py at line 1046,

log_probs = vad_model(input_signal=test_batch[0], input_signal_length=test_batch[1])

when I print shape of test_batch[0] , it prints [58, 4800] which is (numWindows_in_audio, rawAudioSamples_per_window). So, I thought raw audio is the input to the model. Let me know where MFCC extraction is happening on the audio data.

Aug 11 '22 12:08 bchinnari

Hi @bchinnari You can find the processor here.

Aug 11 '22 20:08 fayejf

NeMo NeMo copied to clipboard

Input features for MarbleNet-VAD

NeMo
NeMo copied to clipboard