video and audio corruption reference

Open Curry30Messi opened this issue 9 months ago • 1 comments

In the paper, it is mentioned: “For comprehensive studies, following Hendrycks & Dietterich (2019), we introduce 15 types of corruptions for the video modality and 6 for the audio modality. Each type of corruption has five levels of severity.” However, the reference you provided is related to image corruption. I would like to know whether the corruption operations for video and audio were based on specific modality-related literature, or if they were simply based on common sense.

Mar 09 '25 12:03 Curry30Messi

Hi, we follow CAV-MAE (Gong et al.) and first extract 10 frames for each video. Then, we add corruption for images following the ImageNet-C benchmark. As for the audio, we add corruption from six types of noise from daily scenes.

Mar 17 '25 07:03 mouxingyang