scenic [MBT] Input Data Format for AudioSet

[MBT] Input Data Format for AudioSet

Open nku-zhichengzhang opened this issue 2 years ago • 2 comments

Hello,

I'm working on reproduce the results in your paper "Attention Bottlenecks for Multimodal Fusion" and try to implement MBT for other audiovisual video classification tasks.

However, the preprocessing for dataset (e.g. AudioSet, Kinetics-Sounds) is non-trivial, even with the provided examples in ViViT. And the main confusing part is about extracting audio (i.e. spectrogram). The recommended code of DMVR ("DMVR/examples/generate_from_file.py") extracts all-zero signals for audio. Besides, the extracted audio is not for spectrogram. Is there some details I missed?

Could you kindly show the preprocessing case for visual-audio datasets? Thx

Sep 18 '22 07:09 nku-zhichengzhang

Good question!

Sep 21 '22 13:09 yangjiangeyjg

I encountered the same problem as you did. Did you manage to solve it? Also, for the preprocessing of the Audioset dataset, could you provide some reference code?

Jan 22 '24 14:01 huangfei00

scenic scenic copied to clipboard

[MBT] Input Data Format for AudioSet

scenic
scenic copied to clipboard