video_summarization
video_summarization copied to clipboard
A computing solution based on deep learning that allows the efficient generation of keyshot type spotlights from videos.
Video Summarization of Sports Videos based on MSVA
Based on https://github.com/TIBHannover/MSVA
This repository contains a PyTorch implementation of MSVA model with different feature vectors. I compared GoogleNet, ResNext, InceptionV3, I3D RGB, I3D FLOW, and ResNet3D to see which one contributes more to the video summarization task. Additionaly, I propose a method to process different anotated set of videos. Finally, because there is no an official split, I demostrated that the metric highly depends on it.
To get the datasets and weights used in this repository, log into your google account and run the code
-
pip3 install gshell==5.5.2
-
gshell init
to log into your account
More about gshell: https://pypi.org/project/gshell/
System requirements
-
I strongly recomend Linux for performance and compatibility.
-
Python 3.8.11 is recomended due to library versions.
-
I suggest creating a new virtal environment:
conda create -y -n vsm python=3.8.11
-
To install the libraries, I used PIP
-
GPU is not mandatory, but it speeds up training time
-
N_CUDA
environment variable is defined to choose the GPU in case of having more than onepip3 install -r requirements.txt # use pip in case your pip has python3
Note: in case doesnt work -> pip3 install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
Pretrained models and transformations
In order to experiment with pretrained models, it is necessary to download them. It contains:
- Pretrained MSVA models taken from here: tvsum_random_non_overlap_0.6271.tar.pth and summe_random_non_overlap_0.5359.tar.pth
- Pretrained FLOW and RGB imagenet models taken from here: flow_imagenet.pt and rgb_imagenet.pt
- Pretrained resnet 3D taken from here: r3d101_KM_200ep.pth
- Transformations to fuse all the feature vectors: transformations.pk
./scripts/downloadPTModels.sh #remember to log into your google account
Dataset
To download the raw dataset (SumMe, TVSum, VSumm, CoSum, and Visiocity), I used the gshell library. It can be downloaded from pip. Since it has 16GB I recommend using it. Run the following command to get the preprocessed dataset.
./scripts/downloadDataset.sh`
Manually, this is the link
To generate the h5 files which contain the processed dataset run the following code for each dataset
python3 generate_dataset.py --videospath <VIDEO_PATH> --groundtruthpath <GROUND_TRUTH_PATH> --dataset <DATASET> --pathweightsflow <PATH_WEIGHTS_FLOW> --pathweightsrgb <PATH_WEIGHTS_RGB> --pahtweightsr3d101KM <PATH_WEIGHTS_R3D>
-
VIDEO_PATH
: path where videos are located -
GROUND_TRUTH_PATH
: path where ground truth annotations are located -
DATASET
: dataset name -> summe, tvsum, youtube, ovp or cosum -
PATH_WEIGHTS_FLOW
: path where weights flow are located -
PATH_WEIGHTS_RGB
: path where weights rgb are located -
PATH_WEIGHTS_R3D
: path where weights r3d101km are located
This folder contains 9 datasets for video summarization, 5 generated by the generate_dataset.py code and the rest by previous works:
- dataset_cosum_processed.h5
- dataset_ovp_processed.h5
- dataset_summe_processed.h5
- dataset_tvsum_processed.h5
- dataset_youtube_processed.h5
- eccv16_dataset_ovp_google_pool5.h5
- eccv16_dataset_summe_google_pool5.h5
- eccv16_dataset_tvsum_google_pool5.h5
- eccv16_dataset_youtube_google_pool5.h5
To download the processed datasets:
./scripts/downloadPDataset.sh
Each h5 file follows the same data structure:
key | Description |
---|---|
features | 2D-array with shape (n_steps, 1024) contains feature vectors representing video frames. Each video frame can be represented by a feature vector (containing some semantic meanings), extracted by a pretrained convolutional neural network (e.g. GoogLeNet). It is used in traning, test and inference time. Trained for the image classification task. |
features_rn | 2D-array with shape (n_steps, 2048) contains feature vectors representing video frames just like features key. In this case, extracted by ResNext 101 32x8d pretrained convolutional neural network. Trained for the image classification task. |
features_iv3 | 2D-array with shape (n_steps, 2048) contains feature vectors representing video frames just like features key. In this case, extracted by Inception V3 pretrained convolutional neural network. Trained for the image classification task. |
features_rgb | 2D-array with shape (n_steps * rate, 1024) contains feature vectors representing video frames just like features key. In this case, extracted by Two-Stream Inflated 3D ConvNets (I3D) pretrained convolutional neural network (RGB features). Trained for the action recognition task. |
features_flow | 2D-array with shape (n_steps * rate, 1024) contains feature vectors representing video frames just like features key. In this case, extracted by Two-Stream Inflated 3D ConvNets (I3D) pretrained convolutional neural network (FLOW features). Trained for the action recognition task. |
features_3D | 2D-array with shape (n_steps * rate, 2048) contains feature vectors representing video frames just like features key. In this case, extracted by ResNet3D pretrained convolutional neural network. Trained for the action recognition task. |
gtscore | 1D-array with shape (n_steps), stores ground truth improtance score (used for training, e.g. regression loss) is the average of multiple importance scores (used by regression loss). It is used in training and test time. |
user_summary | 2D-array with shape (num_users, n_frames), each row is a binary vector (used for test) contains multiple key-clips given by human annotators and we need to compare our machine summary with each one of the user summaries. It is used in test time. |
change_points | 2D-array with shape (num_segments, 2), each row stores indices of a segment corresponds to shot transitions, which are obtained by temporal segmentation approaches that segment a video into disjoint shots num_segments is number of total segments a video is cut into. It is used in test time. |
n_frame_per_seg | 1D-array with shape (num_segments), indicates number of frames in each segment. It is used in test time. |
n_frames | number of frames in original video. It is used in test time. |
fps | frames per second of the original video |
picks | positions of subsampled frames in original video is an array storing the position information of subsampled video frames. We do not process each video frame since adjacent frames are very similar. We can subsample a video with 2 frame per second or 1 frame per second, which will result in less frames but they are informative. It is useful when we want to interpolate the subsampled frames into the original video (say you have obtained importance scores for subsampled frames and you want to get the scores for the entire video can indicate which frames are scored and the scores of surrounding frames can be filled with these frames). It is used in test time. |
n_steps | number of subsampled frames. |
gtsummary | 1D-array with shape (n_steps), ground truth summary provided by user (used for training, e.g. maximum likelihood) is a binary vector indicating indices of keyframes, and is provided by original datasets as well (this label can be used for maximum likelihood loss). |
video_name | original video name |
Note: Not all files from previous works have the same structure
Training the network
I used Weights & Biases to track the different experiments I did (different features and splits).
- Change or choose one of the
config.json
located in the folder named configs/. - Run
wandb login 17d2772d85cbda79162bd975e45fdfbf3bb18911
to use wandb. - Run the following code for training without wandb tracking.
python3 train_cross_val.py --params <CONFIGFILE_PATH>
* `CONFIGFILE_PATH`: path of the config.json file (see Hyperparameters Configuration)
You can see the Weights & Biases report here: https://wandb.ai/stevramos/sports_video_summarization
Configuration flags for the training script
Other flags for the train_cross_val.py script
-
--no_wandb
: do not use weights and biases (optional, used by default) -
--pretrained_model
: path of pretrained model (optional) - Using wandb without sweep
-
--wandb
: use weights and biases (required for using wandb) -
--run_name
: name of the execution to save in wandb (optional) -
--run_notes
: notes of the execution to save in wandb (optional)
-
- Using sweep
-
--use_sweep
: to use sweep (required for using sweep, flag wandb is true by default)
-
Hyperparameters Configuration
I used the same hyperparameters as the original code, but with different uses of feature and size vectors. You can modify the following variables in config json files located in the configs folder:
- To set or clear the using of certain feature change the value to true or false in the keys
googlenet
,resnext
,inceptionv3
,i3d_rgb
,i3d_flow
,resnet3d
-
feature_len
: 1024 (in case of normalizing the vectors. Allowed for all the pretrained models) or 2048(to use the original vectors. It can't be use in googlenet, i3d_rgb and i3d_flow) -
type_dataset
: the dataset in which test the model(tvsum or summe) -
type_setting
: canonical, aug(augmented), transfer, non_overlap_*_[aug]
For this purpose, the splits are given by different previous works (see the splits folder)
Cross validation
According prior works, I used cross validation to evaluate the performance.
Baseline model
I trained the MSVA model with original features
- F1 Score SumMe: 0.476
- F1 Score TVSum: 0.594


Grid Search
I trained 64 times the model with different features extracted before to see which one contributes more. Coincidentally, in both sets of videos one of the combinations with the best performance is the use of descriptors GoogleNet, ResNext, I3D FLOW and ResNet3D.


GoogleNet, ResNext, I3D FLOW and ResNet3D VS GoogleNet, I3D RGB e I3D FLOW
It can be observed that the model with the new descriptors, in both cases SumMe and TVSum, exceeds the baseline model.
Note: The comparison was done in one split in the images shown below.


Comparison of F1-Score with the baseline model under Canonical, Augmented and Transfer settings.
Model | SumMe Canonical |
SumMe Augmented |
SumMe Transfer |
TVSum Canonical |
TVSum Augmented |
TVSum Transfer |
---|---|---|---|---|---|---|
Baseline model | 0.476 | - | - | 0.594 | - | - |
Model with new descriptors | 0.499 | 0.48 | 0.43 | 0.613 | 0.62 | 0.57 |
API
I developed a REST API (app.py) using the FastAPI framework which provides several services:
-
summarize-video
: receives a video and save it on GoogleDrive, after that summarize it (with score between 0 - 1) -
get-spotlight
: receives a proportion of time that the user wants the summary video to be -
download-spotlight
: download the summary video
To run the app:
uvicorn app:app --reload --host 0.0.0.0
This app is used by the following front app https://github.com/StevRamos/video_summarization_app
Acknowledgments
I would like to thank the following repositories for releasing the evaluation code / data / splits / metrics that made my research possible.