NeMo
NeMo copied to clipboard
Input features for MarbleNet-VAD
I am performing VAD using examples/asr/vad_infer.py . I see that, from the code, MarbleNet directly operates on raw audio data using convolution filters. There is no separate feature extraction from the audio.
But , when I see the published MarbleNet paper from Nvidia, looks like first MFCC is extracted from audio and then model operates on those MFCCs.
Am I correct with this observation ?
HI, can somebody clarify ?
@bchinnari Sorry I missed the issue and thanks for the question.
No. MarbleNet is using MFCC in vad_infer.py. Pleasev have a look at model yaml file
For vad_multilingual_marblenet (the PR for inference postprocessing yaml will be submitted this week ) is with Mel Spectrogram.
Thanks for the response. I had that question because of the following reason I am running VAD on a test file as follows
python examples/asr/speech_classification/vad_infer.py --config-path="../conf/vad" --config-name="vad_inference_postprocessing.yaml" dataset=test.json
-bash-4.2$ cat test.json
{"audio_filepath": "out.wav", "offset": 0, "duration": null, "label": "infer", "text": "-"}
out.wav is 2.9 sec long, I am using 0.3 size window (0.3x16000 =4800 samples), shift of 0.05s (i.e., 20 windows per sec, full audio has 2.9x20 =58 windows ) . When I check the shape of the data in https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/parts/utils/vad_utils.py at line 1046,
log_probs = vad_model(input_signal=test_batch[0], input_signal_length=test_batch[1])
when I print shape of test_batch[0] , it prints [58, 4800] which is (numWindows_in_audio, rawAudioSamples_per_window). So, I thought raw audio is the input to the model. Let me know where MFCC extraction is happening on the audio data.
Hi @bchinnari You can find the processor here.