scenic
scenic copied to clipboard
[MBT] Input Data Format for AudioSet
Hello,
I'm working on reproduce the results in your paper "Attention Bottlenecks for Multimodal Fusion" and try to implement MBT for other audiovisual video classification tasks.
However, the preprocessing for dataset (e.g. AudioSet, Kinetics-Sounds) is non-trivial, even with the provided examples in ViViT. And the main confusing part is about extracting audio (i.e. spectrogram). The recommended code of DMVR ("DMVR/examples/generate_from_file.py") extracts all-zero signals for audio. Besides, the extracted audio is not for spectrogram. Is there some details I missed?
Could you kindly show the preprocessing case for visual-audio datasets? Thx
Good question!
I encountered the same problem as you did. Did you manage to solve it? Also, for the preprocessing of the Audioset dataset, could you provide some reference code?