Video Summarization of Sports Videos based on MSVA

Based on https://github.com/TIBHannover/MSVA

This repository contains a PyTorch implementation of MSVA model with different feature vectors. I compared GoogleNet, ResNext, InceptionV3, I3D RGB, I3D FLOW, and ResNet3D to see which one contributes more to the video summarization task. Additionaly, I propose a method to process different anotated set of videos. Finally, because there is no an official split, I demostrated that the metric highly depends on it.

To get the datasets and weights used in this repository, log into your google account and run the code

pip3 install gshell==5.5.2
gshell init to log into your account

More about gshell: https://pypi.org/project/gshell/

System requirements

I strongly recomend Linux for performance and compatibility.
Python 3.8.11 is recomended due to library versions.
I suggest creating a new virtal environment:conda create -y -n vsm python=3.8.11
To install the libraries, I used PIP
GPU is not mandatory, but it speeds up training time
N_CUDA environment variable is defined to choose the GPU in case of having more than one
```
pip3 install -r requirements.txt  # use pip in case your pip has python3
```

Note: in case doesnt work -> pip3 install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html

Pretrained models and transformations

In order to experiment with pretrained models, it is necessary to download them. It contains:

Pretrained MSVA models taken from here: tvsum_random_non_overlap_0.6271.tar.pth and summe_random_non_overlap_0.5359.tar.pth
Pretrained FLOW and RGB imagenet models taken from here: flow_imagenet.pt and rgb_imagenet.pt
Pretrained resnet 3D taken from here: r3d101_KM_200ep.pth
Transformations to fuse all the feature vectors: transformations.pk

./scripts/downloadPTModels.sh #remember to log into your google account

Dataset

To download the raw dataset (SumMe, TVSum, VSumm, CoSum, and Visiocity), I used the gshell library. It can be downloaded from pip. Since it has 16GB I recommend using it. Run the following command to get the preprocessed dataset.

./scripts/downloadDataset.sh`

Manually, this is the link

To generate the h5 files which contain the processed dataset run the following code for each dataset

python3 generate_dataset.py --videospath <VIDEO_PATH> --groundtruthpath <GROUND_TRUTH_PATH> --dataset <DATASET> --pathweightsflow <PATH_WEIGHTS_FLOW> --pathweightsrgb <PATH_WEIGHTS_RGB> --pahtweightsr3d101KM <PATH_WEIGHTS_R3D>

VIDEO_PATH: path where videos are located
GROUND_TRUTH_PATH: path where ground truth annotations are located
DATASET: dataset name -> summe, tvsum, youtube, ovp or cosum
PATH_WEIGHTS_FLOW: path where weights flow are located
PATH_WEIGHTS_RGB: path where weights rgb are located
PATH_WEIGHTS_R3D: path where weights r3d101km are located

This folder contains 9 datasets for video summarization, 5 generated by the generate_dataset.py code and the rest by previous works:

dataset_cosum_processed.h5
dataset_ovp_processed.h5
dataset_summe_processed.h5
dataset_tvsum_processed.h5
dataset_youtube_processed.h5
eccv16_dataset_ovp_google_pool5.h5
eccv16_dataset_summe_google_pool5.h5
eccv16_dataset_tvsum_google_pool5.h5
eccv16_dataset_youtube_google_pool5.h5

To download the processed datasets:

./scripts/downloadPDataset.sh

Each h5 file follows the same data structure:

key	Description
features	2D-array with shape (n_steps, 1024) contains feature vectors representing video frames. Each video frame can be represented by a feature vector (containing some semantic meanings), extracted by a pretrained convolutional neural network (e.g. GoogLeNet). It is used in traning, test and inference time. Trained for the image classification task.
features_rn	2D-array with shape (n_steps, 2048) contains feature vectors representing video frames just like features key. In this case, extracted by ResNext 101 32x8d pretrained convolutional neural network. Trained for the image classification task.
features_iv3	2D-array with shape (n_steps, 2048) contains feature vectors representing video frames just like features key. In this case, extracted by Inception V3 pretrained convolutional neural network. Trained for the image classification task.
features_rgb	2D-array with shape (n_steps * rate, 1024) contains feature vectors representing video frames just like features key. In this case, extracted by Two-Stream Inflated 3D ConvNets (I3D) pretrained convolutional neural network (RGB features). Trained for the action recognition task.
features_flow	2D-array with shape (n_steps * rate, 1024) contains feature vectors representing video frames just like features key. In this case, extracted by Two-Stream Inflated 3D ConvNets (I3D) pretrained convolutional neural network (FLOW features). Trained for the action recognition task.
features_3D	2D-array with shape (n_steps * rate, 2048) contains feature vectors representing video frames just like features key. In this case, extracted by ResNet3D pretrained convolutional neural network. Trained for the action recognition task.
gtscore	1D-array with shape (n_steps), stores ground truth improtance score (used for training, e.g. regression loss) is the average of multiple importance scores (used by regression loss). It is used in training and test time.
user_summary	2D-array with shape (num_users, n_frames), each row is a binary vector (used for test) contains multiple key-clips given by human annotators and we need to compare our machine summary with each one of the user summaries. It is used in test time.
change_points	2D-array with shape (num_segments, 2), each row stores indices of a segment corresponds to shot transitions, which are obtained by temporal segmentation approaches that segment a video into disjoint shots num_segments is number of total segments a video is cut into. It is used in test time.
n_frame_per_seg	1D-array with shape (num_segments), indicates number of frames in each segment. It is used in test time.
n_frames	number of frames in original video. It is used in test time.
fps	frames per second of the original video
picks	positions of subsampled frames in original video is an array storing the position information of subsampled video frames. We do not process each video frame since adjacent frames are very similar. We can subsample a video with 2 frame per second or 1 frame per second, which will result in less frames but they are informative. It is useful when we want to interpolate the subsampled frames into the original video (say you have obtained importance scores for subsampled frames and you want to get the scores for the entire video can indicate which frames are scored and the scores of surrounding frames can be filled with these frames). It is used in test time.
n_steps	number of subsampled frames.
gtsummary	1D-array with shape (n_steps), ground truth summary provided by user (used for training, e.g. maximum likelihood) is a binary vector indicating indices of keyframes, and is provided by original datasets as well (this label can be used for maximum likelihood loss).
video_name	original video name

Note: Not all files from previous works have the same structure

Training the network

I used Weights & Biases to track the different experiments I did (different features and splits).

Change or choose one of the config.json located in the folder named configs/.
Run wandb login 17d2772d85cbda79162bd975e45fdfbf3bb18911 to use wandb.
Run the following code for training without wandb tracking.

python3 train_cross_val.py --params <CONFIGFILE_PATH>

* `CONFIGFILE_PATH`: path of the config.json file (see Hyperparameters Configuration)

You can see the Weights & Biases report here: https://wandb.ai/stevramos/sports_video_summarization

Configuration flags for the training script

Other flags for the train_cross_val.py script

--no_wandb: do not use weights and biases (optional, used by default)
--pretrained_model: path of pretrained model (optional)
Using wandb without sweep
- --wandb: use weights and biases (required for using wandb)
- --run_name: name of the execution to save in wandb (optional)
- --run_notes: notes of the execution to save in wandb (optional)
Using sweep
- --use_sweep: to use sweep (required for using sweep, flag wandb is true by default)

Hyperparameters Configuration

I used the same hyperparameters as the original code, but with different uses of feature and size vectors. You can modify the following variables in config json files located in the configs folder:

To set or clear the using of certain feature change the value to true or false in the keys googlenet, resnext, inceptionv3, i3d_rgb, i3d_flow, resnet3d
feature_len: 1024 (in case of normalizing the vectors. Allowed for all the pretrained models) or 2048(to use the original vectors. It can't be use in googlenet, i3d_rgb and i3d_flow)
type_dataset: the dataset in which test the model(tvsum or summe)
type_setting: canonical, aug(augmented), transfer, non_overlap_*_[aug]

For this purpose, the splits are given by different previous works (see the splits folder)

Cross validation

According prior works, I used cross validation to evaluate the performance.

Baseline model

I trained the MSVA model with original features

F1 Score SumMe: 0.476
F1 Score TVSum: 0.594

Grid Search

I trained 64 times the model with different features extracted before to see which one contributes more. Coincidentally, in both sets of videos one of the combinations with the best performance is the use of descriptors GoogleNet, ResNext, I3D FLOW and ResNet3D.

GoogleNet, ResNext, I3D FLOW and ResNet3D VS GoogleNet, I3D RGB e I3D FLOW

It can be observed that the model with the new descriptors, in both cases SumMe and TVSum, exceeds the baseline model.

Note: The comparison was done in one split in the images shown below.

Comparison of F1-Score with the baseline model under Canonical, Augmented and Transfer settings.

Model	SumMe Canonical	SumMe Augmented	SumMe Transfer	TVSum Canonical	TVSum Augmented	TVSum Transfer
Baseline model	0.476	-	-	0.594	-	-
Model with new descriptors	0.499	0.48	0.43	0.613	0.62	0.57

API

I developed a REST API (app.py) using the FastAPI framework which provides several services:

summarize-video: receives a video and save it on GoogleDrive, after that summarize it (with score between 0 - 1)
get-spotlight: receives a proportion of time that the user wants the summary video to be
download-spotlight: download the summary video

To run the app:

uvicorn app:app --reload --host 0.0.0.0

This app is used by the following front app https://github.com/StevRamos/video_summarization_app

Acknowledgments

I would like to thank the following repositories for releasing the evaluation code / data / splits / metrics that made my research possible.

video_summarization
video_summarization copied to clipboard

Metadata

Video Summarization of Sports Videos based on MSVA

System requirements

Pretrained models and transformations

Dataset

Training the network

Configuration flags for the training script

Hyperparameters Configuration

Cross validation

Baseline model

Grid Search

GoogleNet, ResNext, I3D FLOW and ResNet3D VS GoogleNet, I3D RGB e I3D FLOW

API

Acknowledgments

← Metadata

Owner

Metadata

video_summarization video_summarization copied to clipboard

Metadata

Video Summarization of Sports Videos based on MSVA

System requirements

Pretrained models and transformations

Dataset

Training the network

Configuration flags for the training script

Hyperparameters Configuration

Cross validation

Baseline model

Grid Search

GoogleNet, ResNext, I3D FLOW and ResNet3D VS GoogleNet, I3D RGB e I3D FLOW

API

Acknowledgments

← Metadata

Owner

Metadata

video_summarization
video_summarization copied to clipboard