video and audio corruption reference
In the paper, it is mentioned: “For comprehensive studies, following Hendrycks & Dietterich (2019), we introduce 15 types of corruptions for the video modality and 6 for the audio modality. Each type of corruption has five levels of severity.” However, the reference you provided is related to image corruption. I would like to know whether the corruption operations for video and audio were based on specific modality-related literature, or if they were simply based on common sense.
Hi, we follow CAV-MAE (Gong et al.) and first extract 10 frames for each video. Then, we add corruption for images following the ImageNet-C benchmark. As for the audio, we add corruption from six types of noise from daily scenes.