Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

[Project page] [ArXiv] [Dataset(Google drive)] [Dataset(Baidu drive)] [Benchmark]

This repository contains code for CVPR 2023 paper "Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline". This paper introduces the first Untrimmed Audio-Visual (UnAV-100) dataset and proposes to sovle audio-visual event localization problem in more realistic and challenging scenarios.

Requirements

The implemetation is based on PyTorch. Follow INSTALL.md to install required dependencies.

Data preparation

The proposed UnAV-100 dataset can be downloaded from [Project Page], including YouTube links of raw videos, annotations and extracted features.

If you want to use your own choices of video features, you can download the raw videos from this link (Baidu Drive, pwd: qslx). A download script is also provided for raw videos at scripts/video_download.py.

Note: after downloading data, unpack files under data/unav100. The folder structure should look like:

This folder
│   README.md
│   ...  
└───data/
│    └───unav100/
│    	 └───annotations/
│               └───unav100_annotations.json
│    	 └───av_features/   
│               └───__2MwJ2uHu0_flow.npy    # mix all features together
│               └───__2MwJ2uHu0_rgb.npy 
│               └───__2MwJ2uHu0_vggish.npy 
|                   ...
└───libs
│   ...

Training

Run train.py to train the model on UnAV-100 dataset. This will create an experiment folder under ./ckpt that stores training config, logs, and checkpoints.

python ./train.py ./configs/avel_unav100.yaml --output reproduce

Evaluation

Run eval.py to evaluate the trained model.

python ./eval.py ./configs/avel_unav100.yaml ./ckpt/avel_unav100_reproduce

[Optional] We also provide a pretrained model for UnAV-100, which can be downloaded from this link.

Citation

If you find our dataset and code are useful for your research, please cite our paper

@inproceedings{geng2023dense,
  title={Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline},
  author={Geng, Tiantian and Wang, Teng and Duan, Jinming and Cong, Runmin and Zheng, Feng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={22942--22951},
  year={2023}
}

Acknowledgement

The video features of I3D-rgb & flow and Vggish-audio were extracted using video_features. Our baseline model was implemented based on ActionFormer. We thank the authors for sharing their codes. If you use our code, please consider to cite their works.

UnAV
UnAV copied to clipboard

Metadata

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

Requirements

Data preparation

Training

Evaluation

Citation

Acknowledgement

← Metadata

Owner

Metadata

UnAV UnAV copied to clipboard

Metadata

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

Requirements

Data preparation

Training

Evaluation

Citation

Acknowledgement

← Metadata

Owner

Metadata

UnAV
UnAV copied to clipboard