Scalable Neural Video Representations with Learnable Positional Features (NVP)

Official PyTorch implementation of "Scalable Neural Video Representations with Learnable Positional Features" (NeurIPS 2022) by Subin Kim*¹, Sihyun Yu*¹, Jaeho Lee², and Jinwoo Shin¹.

¹KAIST, ²POSTECH

TL;DR: We propose a novel neural representation for videos that is the best of both worlds; achieved high-quality encoding and the compute-/parameter- efficiency simultaneously.

Project Page | Paper

1. Requirements

Environments

Required packages are listed in environment.yaml. Also, you should install the following packages:

conda install pytorch torchvision cudatoolkit=11.3 -c pytorch

pip install git+https://github.com/subin-kim-cv/tiny-cuda-nn/#subdirectory=bindings/torch

This repository of tiny-cuda-nn is slightly different from original implementation of tiny-cuda-nn.

Dataset

Download the UVG-HD dataset from the following link:

UVG-HD

Then, extract RGB sequences from the original YUV videos of UVG-HD using ffmpeg. Here, INPUT is the input file name, and OUTPUT is a directory to save decompressed RGB frames.

ffmpeg -f rawvideo -vcodec rawvideo -s 1920x1080 -r 120 -pix_fmt yuv420p -i INPUT.yuv OUTPUT/f%05d.png

2. Training

Run the following script with a single GPU.

CUDA_VISIBLE_DEVICES=0 python experiment_scripts/train_video.py --logging_root ./logs_nvp --experiment_name <EXPERIMENT_NAME> --dataset <DATASET> --num_frames <NUM_FRAMES> --config ./config/config_nvp_s.json

Option --logging_root denotes the path to save the experiment log.
Option --experiment_name denotes the subdirectory to save the log files (results, checkpoints, configuration, etc.) existed under --logging_root.
Option --dataset denotes the path of RGB sequences (e.g., ~/data/Jockey).
Option --num_frames denotes the number of frames to reconstruct (300 for the ShakeNDry video and 600 for other videos in UVG-HD).
To reconstruct videos with 300 frames, please change the values of t_resolution in configuration file to 300.

3. Evaluation

Evaluation without compression of parameters (i.e., qunatization only).

CUDA_VISIBLE_DEVICES=0 python experiment_scripts/eval.py --logging_root ./logs_nvp --experiment_name <EXPERIMENT_NAME> --dataset <DATASET> --num_frames <NUM_FRAMES> --config ./logs_nvp/<EXPERIMENT_NAME>/config_nvp_s.json

Option --save denotes whether to save the reconstructed frames.
One can specify an option --s_interp for a video superresolution results. It denotes the superresolution scale (e.g., 8).
One can specify an option --t_interp for a video frame interpolation results. It denotes the temporal interpolation scale (e.g., 8).

Evaluation with compression of parameters using well-known image and video codecs.

Save the quantized parameters.

CUDA_VISIBLE_DEVICES=0 python experiment_scripts/compression.py --logging_root ./logs_nvp --experiment_name <EXPERIMENT_NAME> --config ./logs_nvp/<EXPERIMENT_NAME>/config_nvp_s.json

Compress the saved sparse positional image-/video-like features using codecs.
- Execute compression.ipynb.
- Please change the logging_root and experiment_name in compression.ipynb appropriately.
- One can change qscale, crf, framerate which changes the compression ratio of sparse positinal features.
  - qscale ranges from 1 to 31, where larger values mean the worse quality (2~5 recommended).
  - crf ranges from 0 to 51 where larger values mean the worse quality (20~25 recommended).
  - framerate (25 or 40 recommended).

Evaluation with the compressed parameters.

CUDA_VISIBLE_DEVICES=0 python experiment_scripts/eval_compression.py --logging_root ./logs_nvp --experiment_name <EXPERIMENT_NAME> --dataset <DATASET> --num_frames <NUM_FRAMES>  --config ./logs_nvp/<EXPERIMENT_NAME>/config_nvp_s.json --qscale 2 3 3 --crf 21 --framerate 25

Option --save denotes whether to save the reconstructed frames.
Please specify the option --qscale, --crf, --framerate as same with the values in the compression.ipynb.

4. Results

Reconstructed video results of NVP on UVG-HD, and other 4K/long/temporally dynamic videos are available at the following project page.

Our model achieves the following performance on UVG-HD with a single NVIDIA V100 32GB GPU:

Encoding Time	BPP	PSNR (↑)	FLIP (↓)	LPIPS (↓)
~5 minutes	0.901	34.57 $\pm$ 2.62	0.075 $\pm$ 0.021	0.190 $\pm$ 0.100
~10 minutes	0.901	35.79 $\pm$ 2.31	0.065 $\pm$ 0.016	0.160 $\pm$ 0.098
~1 hour	0.901	37.61 $\pm$ 2.20	0.052 $\pm$ 0.011	0.145 $\pm$ 0.106
~8 hours	0.210	36.46 $\pm$ 2.18	0.067 $\pm$ 0.017	0.135 $\pm$ 0.083

The reported values are averaged over the Beauty, Bosphorus, Honeybee, Jockey, ReadySetGo, ShakeNDry, and Yachtride videos in UVG-HD and measured using LPIPS, FLIP repositories.

One can download the pretrained checkpoints from the following link

Citation

@inproceedings{
    kim2022scalable,
    title={Scalable Neural Video Representations with Learnable Positional Features},
    author={Kim, Subin and Yu, Sihyun and Lee, Jaeho and Shin, Jinwoo},
    booktitle={Advances in Neural Information Processing Systems},
    year={2022},
}

References

We used the code from following repositories: SIREN, Modulation, tiny-cuda-nn.

NVP
NVP copied to clipboard

Metadata

Scalable Neural Video Representations with Learnable Positional Features (NVP)

Project Page | Paper

1. Requirements

Environments

Dataset

2. Training

3. Evaluation

4. Results

Citation

References

← Metadata

Owner

Metadata

NVP NVP copied to clipboard

Metadata

Scalable Neural Video Representations with Learnable Positional Features (NVP)

Project Page | Paper

1. Requirements

Environments

Dataset

2. Training

3. Evaluation

4. Results

Citation

References

← Metadata

Owner

Metadata

NVP
NVP copied to clipboard