DALI Which video decoding speed should I expect from DALI?

Hi, I did few simple experiments and I had the feeling that I am getting a lower frame rate than expected. Can you help me understanding if I am doing things properly? Here some info:

Docker image built on top of nvcr.io/nvidia/pytorch:22.11-py3 (hence nvidia-dali-cuda110==1.18.0)
GPU Tesla V100-SXM2
Video specs: 1920x1080@30fps, yuv420p, libx264, ~1.5Mbps (created with ffmpeg -i .... output.mp4 with default settings)

This is the code I am running:

import time
from pathlib import Path
from typing import Union

import nvidia.dali.fn as fn
from nvidia.dali import pipeline_def
from nvidia.dali.plugin.pytorch import DALIGenericIterator
from nvidia.dali.tensors import TensorListGPU
from tqdm import tqdm


def profile_dali_video_reader(profile_video_path: Union[Path, str]) -> None:
    @pipeline_def
    def video_pipeline(
        video_path: str,
        sequence_length: int = 1,
        step: int = -1,
    ) -> TensorListGPU:
        frames = fn.readers.video(
            name="reader",
            filenames=f"{video_path}",
            sequence_length=sequence_length,
            step=step,
            initial_fill=16,
            device="gpu",
        )
        return frames

    iterator = DALIGenericIterator(
        [
            video_pipeline(
                video_path=profile_video_path, batch_size=4, device_id=0, num_threads=1
            )
        ],
        ["frames"],
        reader_name="reader",
    )

    t_start = time.monotonic()
    for _ in tqdm(iterator):
        pass
    t_end = time.monotonic()
    print(f"Total time: {t_end - t_start:.2f}sec ({1000 / (t_end - t_start):.2f}FPS)")


if __name__ == "__main__":
    profile_dali_video_reader("tests/files/1000frames1080p.mp4")

Output is:

Total time: 203.96sec (4.90FPS)

Is it reasonable? Am I doing something wrong?

Dec 09 '22 00:12 elmuz

One thing that I noticed is that the only "boosting parameter" is the sequence_length. In fact, for example pushing that value to 20, I get:

Total time: 11.23sec (89.08FPS)  # (batch=1, seq=20)

On the contrary, switching that value with batch size (still a random_shuffle=False context) doesn't help:

Total time: 202.69sec (4.93FPS)  # (batch=20, seq=1)

Other parameters (like num_threads or initial_fill have negligible impact)

Dec 09 '22 00:12 elmuz

Hi @elmuz,

DALI video reader in order of returning the desired batch so sequences needs to:

seek the first keyframe preceding the first frame in the requested sequence
decode from it till the last frame in the sequence

The bigger distance between keyframes the more frames need to be discarded before the first one that is desired is decoded. Also, the bigger stride is the more frames need to be decoded (and again discarded) in total to obtain the desired sequence. When you increase the sequence_length the less DALI needs to seek (what is good). In some cases (in some containers) seeking the middle of the video stream is not supported and the decoder needs just to start from the beginning. Changing batch size should not affect the decoding speed. Also, the decoding is implemented fully in the hardware and it doesn't support multithreading. So your results are consistent with the way the decoder interacts with the video files.

Dec 09 '22 08:12 JanuszL

Hi @JanuszL, thanks for the message. Honestly, this performance wouldn't justify the idea of abandoning a preprocess step frames extraction in favor of hw decoding on the fly: sure, there's a waste of disk space, a waste of I/O operations, but throughput is higher.

I guess the real "redundancy" is the frame seek as you mentioned. In fact there's no difference between shuffle or not shuffle: even in case of linear decoding, at every batch creation the seek operation takes so much time.

Do you have any recommendation on how I could speed up seeking operation? You mentioned "some containers..." what should I check?
Is there any plan to improve this? E.g. some frames could be cached when step=1 but sequence>1...
I see there's an experimental video decoder (which relies on ffmpeg). Do you think it is faster? What's the roadmap on this? Unifying the two readers at some point?

Dec 09 '22 09:12 elmuz

Hi @elmuz,

Let me answer your questions one by one:

You can try out avi container, it has an index built-in.
You can check -keyint parameter to reduce the distance between keyframes when you encode the video. Also caching may be expensive when working with multiple video files in parallel. Sorting decoded FHD, or even QHD decoded frame is memory expensive.

As you notice we just added an experimental video reader/decoder that can alleviate some of the mentioned problems. For example, the 'experimental.decoders.video' can decode the whole video in one go. It should be fast but longer videos won't fit into the memory. In the meantime, we work on a streaming version of it where you can decode videos in pieces (but shuffling won't work). We will keep you posed.

Dec 09 '22 10:12 JanuszL

I see thanks. It seems the experimental decoder is more similar to the torchvision.io.read_video function (with the same memory limitation). Please keep me posted. I will do the same in case I find something too.

Dec 09 '22 15:12 elmuz

As I understand your use case is training where you want to randomly sample the video. Or inference where you want to read it sequentially?

Dec 09 '22 15:12 JanuszL

Ideally both. I am working on synthetic lipsync and (depending on models involved) during training I randomly sample 1 or few consecutive frames, randomly chosen from different "talking" videos. This part is intrinsically expensive, agree.

During inference, or simply for metric evaluation, I need to traverse a video and feed it to an estimator. Again, depending on the use case it can be a frame-by-frame operation (e.g. compare synthetic frames against original ones) or I might need to traverse using a sliding window approach if the model is created in that way.

Dec 09 '22 17:12 elmuz

In the meanwhile I wrote some sort of wrapper for my PyTorchLightning project. It basically, creates the sliding window approach by unfolding a linear tensor. So, if for example we want to obtain:

[  # batch of 4 elements, each being a window of 5 frames
  [0, 1, 2, 3, 4],  # these are frame numbers
  [1, 2, 3, 4, 5],
  [2, 3, 4, 5, 6],
  [3, 4, 5, 6, 7]
]  # ...asking DALI a batch of video frames of shape [4, 5, 1920, 1080, 3] is quite slow!

we can go with [0, 1, 2, 3, 4, 5, 6, 7] (so shape ([1, N + T - 1]) and using torch.unfold() to reshape. By tweaking the values of batch, sequence_length and step I could accomplish this. Not elegant, but it works. And it is fast. Here's a snippet:

class SmartVideoIterator(DALIGenericIterator):
    def __init__(
        self,
        sequence_length: int,
        batch_size: int,
        *args,
        **kwargs,
    ):
        """An iterator which manipulates frames so that they can be used as a sliding
        window traversal.

        For this trick to work, the video pipeline must have the following parameters:
          * ``batch_size=1``
          * ``sequence_length=(batch_size + sequence_length - 1)``
          * ``step=batch_size``
          * ``shuffle=False``
          * ``pad_sequences=True``

        Args:
            sequence_length: The desired sequence length.
            batch_size: The desired batch size.

        """
        super().__init__(*args, **kwargs)
        self.true_sequence_length = sequence_length
        self.true_batch_size = batch_size

    def __next__(self):
        return super().__next__()

    def reshape(self, frames: TensorGPU) -> Tuple[TensorGPU, int]:
        """Reshape the decoded tensor into [B, C, T, H, W].

        In order to speed up the decoding process, the batch dimension is shifted to
        the [T]ime dimension. This function reshape according to the original shape
        definition.

        Args:
            frames: Tensor of shape [1, N, H, W, 3], where N = B + T - 1.

        Returns:
            A tuple ``[frames (TensorGPU), this_batch_size (int)]``, where the tensor
            is reshaped as intended.

        Raises:
            StopIteration: if the non-zero elements are not enough to fill even a
                single sequence.

        """
        if (
            len(frames.shape) != 5
            or frames.shape[0] != 1
            or frames.shape[1] != self.true_batch_size + self.true_sequence_length - 1
            or frames.shape[-1] != 3
        ):
            raise ValueError(
                f"Unexpected tensor shape {frames.shape}. Expected [1, N, H, W, 3]."
            )

        this_batch_size = (torch.amax(frames, dim=(2, 3, 4)) > 0).sum() - (
            self.true_sequence_length - 1
        )

        if this_batch_size < 1:
            raise StopIteration

        output = frames[0].unfold(0, self.true_sequence_length, 1).permute(0, 3, 4, 1, 2)
        output = output[:this_batch_size]

        return output, this_batch_size

Then, the iterator (which extends the SmartIterator above) would have

    def __next__(self):
        out = super().__next__()
        # DDP is used so only one pipeline per process
        frames, this_batch_size = self.reshape(out[0]["frames"])

Dec 12 '22 10:12 elmuz

The DALI video reader is mostly designed for training and random sampling. The solution you propose is probably best with the current DALI design as it can just create different views for the same underlying memory while in DALI this is just a special case that is not optimized yet (we assume that each sample in the batch can be independent).

Dec 12 '22 11:12 JanuszL

Yes, I agree. In fact this is only for the predict_dataloader() (in Lightning perspective)

Dec 12 '22 12:12 elmuz

I have a question, how Dali，reader.video decodes a video fully? as we known, gpu codec sdk, it can get the frame number of a video, and it can decode a video and stop until it returns back a null data..

Mar 11 '23 09:03 fromse95

Hi @fromse95,

DALI video reader can be used for training where the usual use case is to return a batch randomly samples sequences of given step, stride, and length and inference where you either have short clips and you want to decode them fully (one video is one sample) or have a long video that returns n-frames as a single sample. If you can tell us how you want to decode and consume the data I can provide more detailed guidance. Also, I recommend checking the Video Processing Framework that provides a Python API to the NVIDIA GPU accelerated video decoding.

Mar 13 '23 09:03 JanuszL

our training tasks require to decode every video fully and input into a list, then transform into torch on gpu, if using DALI to decode videos which have various duration(frame number), which ops can do it ? for long videos, we justly want pre 300 frames(every second per frame, namely 300s video), and if not a long one, we get a frame per second wholy. we use dali because it can deliver GPU decoder date to torch on gpu device, otherwise, we must copy to cpu memory, then cpy to torch on gpu. Thanks for your answer! Another question, GPU decoder sdk released by nvidia can not decode all type video successfully, in traning process, whether the pipeline of decoding will fail, and how DALI deal with the decoder error?

Mar 15 '23 03:03 fromse95

Hi @fromse95,

decode videos which have various duration(frame number), which ops can do it ? for long videos, we justly want pre 300 frames(every second per frame, namely 300s video), and if not a long one, we get a frame per second wholy.

I think the experimental.decoders.video is the best match for your requirements. You need to provide the data to it using the external source. For example something like this:

filenames = [
             'DALI_extra/db/video/cfr/test_2_hevc.mp4',
             'DALI_extra/db/video/cfr/test_2.mp4',
             'DALI_extra/db/video/cfr/test_2.avi',
             'DALI_extra/db/video/cfr/test_1_hevc.mp4'
             ]

batch_size = 5

def video_loader(batch_size, epochs=1):
    idx = 0
    while idx < epochs * len(filenames):
        batch = []
        for _ in range(batch_size):
            batch.append(np.fromfile(filenames[idx % len(filenames)], dtype=np.uint8))
            idx = idx + 1
        yield batch

@pipeline_def(device_id=0)
def video_decoder_pipeline(source):
    data = fn.external_source(source=source, dtype=types.UINT8, ndim=1)
    return fn.experimental.decoders.video(data, device="mixed")

should do,

Another question, GPU decoder sdk released by nvidia can not decode all type video successfully, in traning process, whether the pipeline of decoding will fail, and how DALI deal with the decoder error?

DALI doesn't support fallback to the CPU yet. There is no elegant solution for that yet besides capturing the exception and decoding this particular video on the CPU using a different solution.

Mar 15 '23 06:03 JanuszL

when run this code, error happens: Error when executing Mixed operator experimental__decoders__Video encountered: Stacktrace (16 entries): [frame 0]: /usr/local/python3/lib/python3.8/site-packages/nvidia/dali/libdali_operators.so(+0x63341e) [0x7f712ba2741e] File "/usr/local/python3/lib/python3.8/site-packages/nvidia/dali/pipeline.py", line 1037, in _outputs RuntimeError: Critical error in pipeline:Error when executing Mixed operator experimental__decoders__Video encountered: The code:

filenames=[#'/data/home/wfq/pytest/dldata/268.mp4',
           #'/data/home/wfq/pytest/dldata/440.mp4',
           '/data/models/dldata/test.mp4']
batch_size = 3
epochs = 1
def video_loader(batch_size, epochs=1):
    idx = 0
    while idx < epochs * len(filenames):
        batch = []
        for _ in range(batch_size):
            batch.append(np.fromfile(filenames[idx % len(filenames)], dtype=np.uint8))
            idx = idx + 1
        #print('batch num:', len(batch))
        #print('data:', batch[2])
        yield batch

#@pipeline_def(device_id=0,num_threads=1,batch_size=1)
@pipeline_def
def video_decoder_pipeline():
    data = fn.external_source(device='cpu',source=video_loader(1), dtype=types.UINT8, ndim=1)
    seq = fn.experimental.decoders.video(data, device="mixed")
    return seq
#print('size:', len(data))
#print('for data', data[0])
pipe = video_decoder_pipeline(batch_size = 1, num_threads = 3, device_id = 0, prefetch_queue_depth = 1)
pipe.build()
for f in filenames:
    img_out = pipe.run()
    if device== 'mixed':
       frame  = img_out[0].as_cpu().as_array()[0]

Mar 17 '23 11:03 fromse95

Hi @fromse95,

Does it happen only for this video or also for the ones from https://github.com/NVIDIA/DALI_extra/tree/main/db/video? Does ti happend for the CPU decoder as well? Can you share the problematic video?

Mar 17 '23 11:03 JanuszL

https://github.com/NVIDIA/DALI_extra/blob/main/db/video/cfr_test.mp4 this video has the same error, whether my codes above have bugs? test.mp4 can be decoded by GPU codec sdk. I want to find a method that use c++ to decode a video by nvidia codec sdk, the data is in GPU, must copy to host memory and then copy to tensor in pytorch, how can I pass the gpu data to tensor constructed in pytorch ? The decoding a video and copying to cpu memory are done by c++ and passing data to torch by pybind 11, but i donot know how to pass gpu data to pytorch..

Mar 17 '23 11:03 fromse95

@fromse95,

It seems there is an issue with certain videos in this particular decoder. Let us take a look at it.

Mar 17 '23 14:03 JanuszL

I think I identified the problem and https://github.com/NVIDIA/DALI/pull/4723 is an attempt to fix it. Please check the nightly build that follows the merge of the fix.

Mar 17 '23 18:03 JanuszL

thank you for your answers which are very meaningful !

Mar 21 '23 13:03 fromsewang