AVSA

This repository contains the code and dataset in our NeurIPS'20 paper.

Learning Representations from Audio-Visual Spatial Alignment. Pedro Morgado*, Yi Li*, Nuno Vasconcelos. Advances in Neural Information Processing Systems (NeurIPS), 2020.

Prerequisites

Requirements listed in environment.yml.

Data preparation

YT-360 dataset

YouTube id's of videos in the YT-360 dataset are provided in datasets/assets/yt360/[train|test].txt, and segment timestamps in datasets/assets/yt360/segments.txt. Please, use your favorite YouTube dataset downloader to download the videos (e.g.~link), and split them into 10s clips. The dataset should be stored in data/yt360/video and data/yt360/audio with filenames {YOUTUBE_ID}-{SEGMENT_START_TIME}.{EXTENSION}.

The pre-extracted segmentation maps can be downloaded from here and extracted to data/yt360/segmentation/.

If you experience issues downloading or processing the dataset, please email the authors at {pmaravil, yil898}@eng.ucsd.edu for assistance.

Pre-trained model

The AVSA model that yield the top performance (trained from configs/main/avsa/Cur-Loc4-TransfD2.yaml) is available here.

Self-supervised training

python main-video-ssl.py [--quiet] cfg

Training config cfg for the following models are provided:

AVC training (instance discrimination): configs/main/avsa/InstDisc.yaml
AVSA training: configs/main/avsa/Cur-Loc4-TransfD2.yaml
AVSA training w/o curriculum: configs/main/avsa/NoCur-Loc4-TransfD2.yaml

Evaluation

Four downstream tasks are supported: Binary audio-visual correspondence (AVC-Bin), binary audio-visual spatial alignment (AVSA-Bin), video action recognition (on UCF/HMDB), and audio-visual semantic segmentation.

Action recognition

python eval-action-recg.py [--quiet] cfg model_cfg

Evaluation config cfg for UCF and HMDB dataset are provided:

UCF: configs/benchmark/ucf/ucf-8at16-fold[1|2|3].yaml
HMDB: configs/benchmark/hmdb/hmdb-8at16-fold[1|2|3].yaml

model_cfg is training config for the model to evaluate, e.g. configs/main/avsa/Cur-Loc4-TransfD2.yaml for AVSA pre-training.

Semantic segmentation

python eval-audiovisual-segm.py [--quiet] cfg model_cfg

Evaluation config cfg for three settings are provided:

Visual segmentation: configs/benchmark/segmentation/yt360-fpn-4crop-head-vonly.yaml
Visual+audio segmentation: configs/benchmark/segmentation/yt360-fpn-4crop-head-audio.yaml
Visual+audio segmentation with context: configs/benchmark/segmentation/yt360-fpn-4crop-head-audio-ctx.yaml

Binary audio-visual correspondence

python eval-avc.py [--quiet] cfg model_cfg

Evaluation config cfg for two settings are provided:

With transformer: configs/benchmark/avc/avc-transf-[1|4]crop.yaml
Without transformer: configs/benchmark/avc/avc-notransf-[1|4]crop.yaml

Binary audio-visual spatial alignment

python eval-avsa.py [--quiet] cfg model_cfg

Evaluation config cfg for two settings are provided:

With transformer: configs/benchmark/avsa/avsa-transf-[1|4]crop.yaml
Without transformer: configs/benchmark/avsa/avsa-notransf-[1|4]crop.yaml

Citations

Please cite our work if you find it helpful for your research:

@article{morgado2020learning,
  title={Learning Representations from Audio-Visual Spatial Alignment},
  author={Morgado, Pedro and Li, Yi and Nvasconcelos, Nuno},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  year={2020}
}

Acknowledgements

This work was partially funded by NSF award IIS-1924937 and NVIDIA GPU donations. We also acknowledge and thank the use of the Nautilus platform for some of the experiments in paper.

AVSpatialAlignment
AVSpatialAlignment copied to clipboard

Metadata

AVSA

Prerequisites

Data preparation

YT-360 dataset

Pre-trained model

Self-supervised training

Evaluation

Action recognition

Semantic segmentation

Binary audio-visual correspondence

Binary audio-visual spatial alignment

Citations

Acknowledgements

← Metadata

Owner

Metadata

AVSpatialAlignment AVSpatialAlignment copied to clipboard

Metadata

AVSA

Prerequisites

Data preparation

YT-360 dataset

Pre-trained model

Self-supervised training

Evaluation

Action recognition

Semantic segmentation

Binary audio-visual correspondence

Binary audio-visual spatial alignment

Citations

Acknowledgements

← Metadata

Owner

Metadata

AVSpatialAlignment
AVSpatialAlignment copied to clipboard