DALI readers.video - Sampling from a subset of frames

Hi. I want to achieve the following result, if possible.

SCENARIO I am working on a project regarding faces. I have long (labeled) video footage of which only few segments are interesting. Maybe I have a face track from frame S1 to E1, then another face track from S2 to E2 and so on. I have big json file with all these tracks, and relative bounding boxes. Eg:

      [======= F1 =======]
      :            [====== F2 ======]
[-----|------------|-----|----------|-----] t
0    100          300   400        600   700

Then I have this info in the json:

100 -> BBox
101 -> BBox
...
350 -> [BBox, BBox]
...
450 -> BBox
...
600 -> BBox

EXPECTED BEHAVIOR I want the video reader to return a batch where each sample is (frame, bounding box)

Tuple[NDArray, Tuple[int, int, int, int]]

valid frames only (e.g. in the example above excluding range 0-99 and 601-700)
one BBox. Or the full list, then I could randomly choose one of them. In this moment I am not interested into temporal ranges (that's the next step), just sequence length of 1 is fine.

Is there a way to achieve this?

Oct 27 '22 11:10 elmuz

Hi @elmuz, the video reader file_list argument allows you to point DALI to a text file, which should contain entries of the following format: file label [start_frame [end_frame]] This way you can tell DALI which frame ranges you are interested in, you can use any label as a placeholder, as you don't need it. Than you can use enable_frame_num so the video reader returns you both the video sequence (of the specified length, it can be 1), and the index of the frame that was used.

You would need to look up the bounding box outside of DALI based on the frame index that was returned.

More details in the doc for video reader: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/operations/nvidia.dali.fn.readers.video.html (you can find several tutorials at the bottom of the page).

Oct 27 '22 12:10 klecki

I understand. It all makes sense. Thank you.

Oct 27 '22 12:10 elmuz

@klecki sorry to bother again. I am using PyTorch Lightning in my system and I am writing the DALI block as a DALIGenericIterator to be used as a DataLoader. However I see few limitations here, where maybe you can give me some hints.

It seems to me that the frame number is returned as a GPU tensor, which if I understand correctly cannot be used as an index over other pre-computed 2D arrays (e.g. I have BBoxes of shape [n_frames, 4], or landmarks, or Mel spectrograms).
So I guess this info on index can only be used outside DALI, which is ok, but I am not sure how. Can DALI pipeline be "mixed" into standard PyTorch Dataset/DataLoader approach?
Another idea might be let DALIGenericIterator return [frame, frame_num, all_bboxes, all_landmarks, ...] and slicing those boxes in the forward function. In this case I am only concerned about useless memory operations. For example:

@pipeline_def
def talking_faces_pipeline(...):
    video, _, frame_num = fn.readers.video(...)  # this is ok, batch-wise
    
    # additionally load full list of other features
    bboxes = load_json_and_convert_to_array(...)  # TensorGPU [n_frames, 4]
    kpts = load_json_and_convert_to_array(...)  # TensorGPU [n_frames, 15]
    mels = load_audio_and_compute_melspect(...)  # TensorGPU [n_frames, 80]

    return video, frame_num, bboxes, kpts, mels

Question: loading and computing bboxes, mels etc is something cached in the GPU memory or they're copied and computed at every batch creation?

Oct 28 '22 18:10 elmuz

Hi @elmuz,

It seems to me that the frame number is returned as a GPU tensor, which if I understand correctly cannot be used as an index over other pre-computed 2D arrays (e.g. I have BBoxes of shape [n_frames, 4], or landmarks, or Mel spectrograms).

Yes, DALI operators can return data either on GPU or CPU, not on both devices at the same time. So the Video decoder returns all the data on the GPU. We plan to allow moving data back to the CPU inside DALI pipeline and provide what you ask for but we cannot commit to any timeline yet.

So I guess this info on index can only be used outside DALI, which is ok, but I am not sure how. Can DALI pipeline be "mixed" into standard PyTorch Dataset/DataLoader approach?

You can check this example to see how to do a post-processing outside of the DALI pipeline in PyTorch Lightning.

Another idea might be let DALIGenericIterator return [frame, frame_num, all_bboxes, all_landmarks, ...] and slicing those boxes in the forward function. In this case I am only concerned about useless memory operations. For example:

You can load the bboxes, kpts only once and then slice them. It should cost only a bit of the CPU memory. However, the ask to load audio at the same time is new. Can you tell us more about how audio should be related to the video files? Now it is hard to synchronize audio and video processing inside one DALI pipeline.

Question: loading and computing bboxes, mels etc is something cached in the GPU memory or they're copied and computed at every batch creation?

DALI pipeline recomputes everything each iteration so you can augment data differently each time.

Oct 28 '22 18:10 JanuszL

Can you tell us more about how audio should be related to the video files?

Sure. I am working on generative models for lip-sync. This means that given an audio input (preprocessed in the frequency domain like Mel spectrogram) I want to generate a face which looks like the input one: here the face crop is my "target/label". Therefore, I need to slice keypoints/boxes/... and audio features. Audio preprocessing is super fast on cpu and doesn't have specific augmentations: I can pre-compute it once for the full video footage and slice it on the fly based on the video frame.

You can check this example to see how to do a post-processing outside of the DALI pipeline in PyTorch Lightning.

This is nice! So I guess this is something I can go for:

    ...
    def setup(...):
        self.bboxes = load_json_and_convert_to_torch_gpu_array(...)
        self.melspectrogram = load_audio_compute_mels_and_convert_to_torch_gpu_array(...)
        
        pipeline = dali_pipeline_for_video(...)

        class LightningWrapper(DALIGenericIterator):  # can I use the DaliGenericIterator ?
            def __init__(self, all_bboxes, all_mels, ....,  *kargs, **kvargs):
                super().__init__(*kargs, **kvargs)
                self.bboxes = bboxes

            def __next__(self):
                frame, frame_num = super().__next__()
                ...
                # slice self.bboxes according to frame_num
                bbox = self.bboxes[frame_num]
                # squeeze
                return frame, bbox, ...
    
         self.train_dataloader = LightningWrapper(self.bboxes, ..., pipeline, ....)

Oct 28 '22 19:10 elmuz