scenic icon indicating copy to clipboard operation
scenic copied to clipboard

[MBT] Input Data Format for AudioSet

Open nku-zhichengzhang opened this issue 2 years ago • 2 comments

Hello,

I'm working on reproduce the results in your paper "Attention Bottlenecks for Multimodal Fusion" and try to implement MBT for other audiovisual video classification tasks.

However, the preprocessing for dataset (e.g. AudioSet, Kinetics-Sounds) is non-trivial, even with the provided examples in ViViT. And the main confusing part is about extracting audio (i.e. spectrogram). The recommended code of DMVR ("DMVR/examples/generate_from_file.py") extracts all-zero signals for audio. Besides, the extracted audio is not for spectrogram. Is there some details I missed?

image

Could you kindly show the preprocessing case for visual-audio datasets? Thx

nku-zhichengzhang avatar Sep 18 '22 07:09 nku-zhichengzhang

Good question!

yangjiangeyjg avatar Sep 21 '22 13:09 yangjiangeyjg

I encountered the same problem as you did. Did you manage to solve it? Also, for the preprocessing of the Audioset dataset, could you provide some reference code?

huangfei00 avatar Jan 22 '24 14:01 huangfei00