vision 2022: state of video IO in torchvision

There have been many developments over the last couple of months with a big push in 2022H1 to get things closed up (mainly by @prabhat00155 and @datumbox). Here I'll try to summarize what is the current state of things.

Features (current, in-dev)

At the moment, torchvision has two API's one can use for video-reading.

read_video video API (stable) -- this is a legacy video-reading solution that we're looking to move away from. However, due to external use, we continue to support and patch it. It supports pyav and video_reader backends.
VideoReader fine-grained API (prototypem #2660) -- we're moving towards this as a goal for 2022. The API itself is finished, however, due to issues with various backends it still remains unused (see the installation issue below). Supports video_reader and GPU backends.

Furthermore, we also have three backends for video reading.

pyav -- naive extension of pyAV capabilities
video_reader -- our own C++ implementation that allows video IO to be torchscriptable. If JIT requirement is dropped, might be deprecated despite minor speed improvements over pyav.
GPU -- highly experimental and not-yet properly tested. Maintenance and further development will depend on the demand from customers and community.

Overall goal in 2022 is to migrate all APIs (and prototype datasets) to the VideoReader API, and hopefully depricate read_video as much as possible.

Related tasks include (will be updated):

[ ] Datasets to use new API #5250
[ ] Reference scripts to use new API

Currently known issues and enhancements needed

Probably the biggest issue plaguing video is installation (see #4260 for some reference). If user wants to install ffmpeg or GPU backends and support for VideoReader API, they need to install torchvision from source, and in the case of GPU also download proprietary drivers from NVIDIA. This process should be properly documented until a better/alternative solution is found.

[ ] Add proper build documentation #3460

Due to the lack of users, the real-world bug reports have been scarce. Here is the (non-exhaustive) list of known issues, and their progress, sorted by topic, with additional comments in italics if applicable.

General

[ ] Change CPU decoder output frames to use ITU709 colour space #5245 -- done, but not merged
[x] Assertion error during dataset creation #4839 #4112 #4357 #2184 #1884
[ ] Mismatch in audio frames returned by pyav and video reader #3986 -- needs revisiting based on latest improvements and bugfixes

`video_reader` backend and `VideoReader` API

[ ] new video reading API crash #5419 (can't reproduce -- help welcome)
[ ] read_video_from_file() causes seg fault with Python 3.9 #4430 -- flakey, can't reproduce on all machines
[ ] video_reader test crashes on Windows #4429
[ ] Black band at certain videos #3534 -- suspected issue in FFMPEG, needs revisiting

GPU decoding issues and enhancements (note, these are low-pri due to lack of developers and road-map changes so we'll be relatively slow in fixing these):

[ ] GPU VideoReader not working #5702
[ ] video classification experiments using GPU decoder #5252
[ ] video classification reference script with GPU decoder support #5251
[ ] GPU decoder refactoring #5148
[ ] Run GPU decoding tests in CI #5147
[ ] Support reading video from memory #5142
[ ] Return pts per frame after video decoding on GPU #5140

Archived feature requests

[ ] FFmpeg-based rescaling and frame rate #3016 -- enhancement we've put on pause due to low adoption
[Feat] Camera Stream API proposal #2920
Contribution: select classes in UCF101 dataset #1791

cc @datumbox for visibility

Apr 01 '22 10:04 bjuncek

Hello. Apologies if this is 'the wrong place' to post feedback on video functions.

One thing important for professional use cases (read - working with master video files) is PTS timing that maintains the rational integer representation of sample based timing. Ie libAV provides access to the streams time base, as well as the presentation time stamp as a numerator.

Another important aspect is the ability to introspect rich container metadata, as well as timecode data. This is important for correspondence with side car (or embedded) text tracks like closed captioning / subtitles.

WRT to PyAV - it appears with some specific code invocations and a properly compiled FFMPEG, and a pyAV install that doesn't overwrite the FFMPEG library, that GPU decode is possible. ( pip install av --no-binary av)

See

https://github.com/PyAV-Org/PyAV/issues/451 for nv_dec +

And https://github.com/PyAV-Org/PyAV/issues/596 for nv_enc encoding with a 10x speed up.

I think the only missing piece is direct GPU decode to a tensor without CPU read back.

It seems like to some degree the GPU expertise demonstrated by the PyTorch developers might be better suited to help support PyAV directly, so the wider community can reap the benefits of a HW accelerated PyAV, direct to GPU decode, and PyTorch gains the benefit of using PyAV which can supply the above 'pro video' stream access to text, metadata, audio, and video streams, as well as fall back to software decode if needed.

Apologies if this is longwinded or misplaced. Im excited for a functional solution to native high performance video infrastructure in DL tooling.

Thanks.

May 10 '22 19:05 vade

FWIW, With a properly set up FFMPEG install (I used jrottenberg/ffmpeg 4.4.2-nvidia2004 base container, install python, and did pip install av --no-binary av). I was able to see a 6x performance increase when choosing h264_cuvid vs h264

h264: Took 0:00:18.961413 h264_cuvid: 0:00:02.938973

in this code:

import av
from datetime import datetime

video_path = "some_file.mp4"

video = av.open(video_path)
target_stream = video.streams.video[0]

print(target_stream)

ctx = av.Codec('h264', 'r').create()
#ctx = av.Codec('h264_cuvid', 'r').create()
ctx.extradata = target_stream.codec_context.extradata

start_time = datetime.now()

for packet in video.demux(target_stream):
    for frame in ctx.decode(packet):
        print(frame)

end_time = datetime.now()

total_time = end_time - start_time

print("Took", total_time)

Media Info on the file:

Format                                   : MPEG-4
Format profile                           : Base Media
Codec ID                                 : isom (isom/iso2/avc1/mp41)
File size                                : 238 MiB
Duration                                 : 37 s 560 ms
Overall bit rate                         : 53.1 Mb/s
Writing application                      : Lavf58.12.100

Video
ID                                       : 1
Format                                   : AVC
Format/Info                              : Advanced Video Codec
Format profile                           : High@L4
Format settings                          : CABAC / 4 Ref Frames
Format settings, CABAC                   : Yes
Format settings, Reference frames        : 4 frames
Codec ID                                 : avc1
Codec ID/Info                            : Advanced Video Coding
Duration                                 : 37 s 560 ms
Bit rate                                 : 53.1 Mb/s
Width                                    : 1 920 pixels
Height                                   : 1 080 pixels
Display aspect ratio                     : 2.40:1
Original display aspect ratio            : 2.40:1
Frame rate mode                          : Constant
Frame rate                               : 25.000 FPS
Color space                              : YUV
Chroma subsampling                       : 4:2:0
Bit depth                                : 8 bits
Scan type                                : Progressive
Bits/(Pixel*Frame)                       : 1.024
Stream size                              : 238 MiB (100%)
Writing library                          : x264 core 155 r2901 7d0ff22
Encoding settings                        : cabac=1 / ref=3 / deblock=1:0:0 / analyse=0x3:0x113 / me=hex / subme=7 / psy=1 / psy_rd=1.00:0.00 / mixed_ref=1 / me_range=16 / chroma_me=1 / trellis=1 / 8x8dct=1 / cqm=0 / deadzone=21,11 / fast_pskip=1 / chroma_qp_offset=-2 / threads=6 / lookahead_threads=1 / sliced_threads=0 / nr=0 / decimate=1 / interlaced=0 / bluray_compat=0 / constrained_intra=0 / bframes=3 / b_pyramid=2 / b_adapt=1 / b_bias=0 / direct=1 / weightb=1 / open_gop=0 / weightp=2 / keyint=250 / keyint_min=25 / scenecut=40 / intra_refresh=0 / rc_lookahead=40 / rc=crf / mbtree=1 / crf=12.0 / qcomp=0.60 / qpmin=0 / qpmax=69 / qpstep=4 / ip_ratio=1.40 / aq=1:1.00
Codec configuration box                  : avcC

Other
ID                                       : 2
Type                                     : Time code
Format                                   : QuickTime TC
Duration                                 : 37 s 560 ms
Frame rate                               : 25.000 FPS
Time code of first frame                 : 14:52:42:03
Time code, striped                       : Yes
Language                                 : English
Default                                  : No

On a 3090 NVIDIA-SMI 510.60.02 Driver Version: 510.60.02 CUDA Version: 11.6

May 10 '22 21:05 vade

It also seems that there's overlap of VideoReader with torchaudio's StreamReader which was added in the latest release: https://pytorch.org/audio/0.12.0/io.html#streamreader, this StreamReader boasts even GPU-based video decoding. IMO it's quite needed that there's no duplication and different APIs for the same thing (especially if the goals are very similar). Maybe factor out these to some common repo/library. If very much wanted, wheels of torchvision and torchaudio could register their own handlers / plugins into the common IO layer. Or even they could ship their own compiled libraries, but at least the source code / API should be unified. Maybe all image/audio/video IO could be moved to some torchio module.

StreamReader probably also comes with its own quirks and problems of ffmpeg compilation / from-source compilation

Factoring ffmpeg-related stuff into its own package would also simplify testing / building of simpler parts of torchvision/torchaudio.

@soumith

Jun 30 '22 21:06 vadimkantorov

vision vision copied to clipboard

2022: state of video IO in torchvision

Features (current, in-dev)

Currently known issues and enhancements needed

General

video_reader backend and VideoReader API

GPU decoding issues and enhancements (note, these are low-pri due to lack of developers and road-map changes so we'll be relatively slow in fixing these):

Archived feature requests

vision
vision copied to clipboard

`video_reader` backend and `VideoReader` API