vision
vision copied to clipboard
2022: state of video IO in torchvision
There have been many developments over the last couple of months with a big push in 2022H1 to get things closed up (mainly by @prabhat00155 and @datumbox). Here I'll try to summarize what is the current state of things.
Features (current, in-dev)
At the moment, torchvision
has two API's one can use for video-reading.
-
read_video
video API (stable) -- this is a legacy video-reading solution that we're looking to move away from. However, due to external use, we continue to support and patch it. It supportspyav
andvideo_reader
backends. -
VideoReader
fine-grained API (prototypem #2660) -- we're moving towards this as a goal for 2022. The API itself is finished, however, due to issues with various backends it still remains unused (see the installation issue below). Supportsvideo_reader
andGPU
backends.
Furthermore, we also have three backends for video reading.
-
pyav
-- naive extension of pyAV capabilities -
video_reader
-- our own C++ implementation that allows video IO to be torchscriptable. If JIT requirement is dropped, might be deprecated despite minor speed improvements overpyav
. -
GPU
-- highly experimental and not-yet properly tested. Maintenance and further development will depend on the demand from customers and community.
Overall goal in 2022 is to migrate all APIs (and prototype datasets) to the VideoReader
API, and hopefully depricate read_video
as much as possible.
Related tasks include (will be updated):
- [ ] Datasets to use new API #5250
- [ ] Reference scripts to use new API
Currently known issues and enhancements needed
Probably the biggest issue plaguing video is installation (see #4260 for some reference). If user wants to install ffmpeg or GPU backends and support for VideoReader
API, they need to install torchvision from source, and in the case of GPU also download proprietary drivers from NVIDIA. This process should be properly documented until a better/alternative solution is found.
- [ ] Add proper build documentation #3460
Due to the lack of users, the real-world bug reports have been scarce. Here is the (non-exhaustive) list of known issues, and their progress, sorted by topic, with additional comments in italics if applicable.
General
- [ ] Change CPU decoder output frames to use ITU709 colour space #5245 -- done, but not merged
- [x] Assertion error during dataset creation #4839 #4112 #4357 #2184 #1884
- [ ] Mismatch in audio frames returned by pyav and video reader #3986 -- needs revisiting based on latest improvements and bugfixes
video_reader
backend and VideoReader
API
- [ ] new video reading API crash #5419 (can't reproduce -- help welcome)
- [ ] read_video_from_file() causes seg fault with Python 3.9 #4430 -- flakey, can't reproduce on all machines
- [ ] video_reader test crashes on Windows #4429
- [ ] Black band at certain videos #3534 -- suspected issue in FFMPEG, needs revisiting
GPU decoding issues and enhancements (note, these are low-pri due to lack of developers and road-map changes so we'll be relatively slow in fixing these):
- [ ] GPU VideoReader not working #5702
- [ ] video classification experiments using GPU decoder #5252
- [ ] video classification reference script with GPU decoder support #5251
- [ ] GPU decoder refactoring #5148
- [ ] Run GPU decoding tests in CI #5147
- [ ] Support reading video from memory #5142
- [ ] Return pts per frame after video decoding on GPU #5140
Archived feature requests
- [ ] FFmpeg-based rescaling and frame rate #3016 -- enhancement we've put on pause due to low adoption
- [Feat] Camera Stream API proposal #2920
- Contribution: select classes in UCF101 dataset #1791
cc @datumbox for visibility
Hello. Apologies if this is 'the wrong place' to post feedback on video functions.
One thing important for professional use cases (read - working with master video files) is PTS timing that maintains the rational integer representation of sample based timing. Ie libAV provides access to the streams time base, as well as the presentation time stamp as a numerator.
Another important aspect is the ability to introspect rich container metadata, as well as timecode data. This is important for correspondence with side car (or embedded) text tracks like closed captioning / subtitles.
WRT to PyAV - it appears with some specific code invocations and a properly compiled FFMPEG, and a pyAV install that doesn't overwrite the FFMPEG library, that GPU decode is possible. ( pip install av --no-binary av
)
See
https://github.com/PyAV-Org/PyAV/issues/451 for nv_dec +
And https://github.com/PyAV-Org/PyAV/issues/596 for nv_enc encoding with a 10x speed up.
I think the only missing piece is direct GPU decode to a tensor without CPU read back.
It seems like to some degree the GPU expertise demonstrated by the PyTorch developers might be better suited to help support PyAV directly, so the wider community can reap the benefits of a HW accelerated PyAV, direct to GPU decode, and PyTorch gains the benefit of using PyAV which can supply the above 'pro video' stream access to text, metadata, audio, and video streams, as well as fall back to software decode if needed.
Apologies if this is longwinded or misplaced. Im excited for a functional solution to native high performance video infrastructure in DL tooling.
Thanks.
FWIW, With a properly set up FFMPEG install (I used jrottenberg/ffmpeg 4.4.2-nvidia2004 base container, install python, and did pip install av --no-binary av
). I was able to see a 6x performance increase when choosing h264_cuvid
vs h264
h264: Took 0:00:18.961413 h264_cuvid: 0:00:02.938973
in this code:
import av
from datetime import datetime
video_path = "some_file.mp4"
video = av.open(video_path)
target_stream = video.streams.video[0]
print(target_stream)
ctx = av.Codec('h264', 'r').create()
#ctx = av.Codec('h264_cuvid', 'r').create()
ctx.extradata = target_stream.codec_context.extradata
start_time = datetime.now()
for packet in video.demux(target_stream):
for frame in ctx.decode(packet):
print(frame)
end_time = datetime.now()
total_time = end_time - start_time
print("Took", total_time)
Media Info on the file:
Format : MPEG-4
Format profile : Base Media
Codec ID : isom (isom/iso2/avc1/mp41)
File size : 238 MiB
Duration : 37 s 560 ms
Overall bit rate : 53.1 Mb/s
Writing application : Lavf58.12.100
Video
ID : 1
Format : AVC
Format/Info : Advanced Video Codec
Format profile : High@L4
Format settings : CABAC / 4 Ref Frames
Format settings, CABAC : Yes
Format settings, Reference frames : 4 frames
Codec ID : avc1
Codec ID/Info : Advanced Video Coding
Duration : 37 s 560 ms
Bit rate : 53.1 Mb/s
Width : 1 920 pixels
Height : 1 080 pixels
Display aspect ratio : 2.40:1
Original display aspect ratio : 2.40:1
Frame rate mode : Constant
Frame rate : 25.000 FPS
Color space : YUV
Chroma subsampling : 4:2:0
Bit depth : 8 bits
Scan type : Progressive
Bits/(Pixel*Frame) : 1.024
Stream size : 238 MiB (100%)
Writing library : x264 core 155 r2901 7d0ff22
Encoding settings : cabac=1 / ref=3 / deblock=1:0:0 / analyse=0x3:0x113 / me=hex / subme=7 / psy=1 / psy_rd=1.00:0.00 / mixed_ref=1 / me_range=16 / chroma_me=1 / trellis=1 / 8x8dct=1 / cqm=0 / deadzone=21,11 / fast_pskip=1 / chroma_qp_offset=-2 / threads=6 / lookahead_threads=1 / sliced_threads=0 / nr=0 / decimate=1 / interlaced=0 / bluray_compat=0 / constrained_intra=0 / bframes=3 / b_pyramid=2 / b_adapt=1 / b_bias=0 / direct=1 / weightb=1 / open_gop=0 / weightp=2 / keyint=250 / keyint_min=25 / scenecut=40 / intra_refresh=0 / rc_lookahead=40 / rc=crf / mbtree=1 / crf=12.0 / qcomp=0.60 / qpmin=0 / qpmax=69 / qpstep=4 / ip_ratio=1.40 / aq=1:1.00
Codec configuration box : avcC
Other
ID : 2
Type : Time code
Format : QuickTime TC
Duration : 37 s 560 ms
Frame rate : 25.000 FPS
Time code of first frame : 14:52:42:03
Time code, striped : Yes
Language : English
Default : No
On a 3090 NVIDIA-SMI 510.60.02 Driver Version: 510.60.02 CUDA Version: 11.6
It also seems that there's overlap of VideoReader with torchaudio's StreamReader which was added in the latest release: https://pytorch.org/audio/0.12.0/io.html#streamreader, this StreamReader boasts even GPU-based video decoding. IMO it's quite needed that there's no duplication and different APIs for the same thing (especially if the goals are very similar). Maybe factor out these to some common repo/library. If very much wanted, wheels of torchvision and torchaudio could register their own handlers / plugins into the common IO layer. Or even they could ship their own compiled libraries, but at least the source code / API should be unified. Maybe all image/audio/video IO could be moved to some torchio module.
StreamReader probably also comes with its own quirks and problems of ffmpeg compilation / from-source compilation
Factoring ffmpeg-related stuff into its own package would also simplify testing / building of simpler parts of torchvision/torchaudio.
@soumith