DALI icon indicating copy to clipboard operation
DALI copied to clipboard

Add an operator for receiving video metadata

Open tomresan opened this issue 1 year ago • 5 comments

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Should have (e.g. Adoption is possible, but the performance shortcomings make the solution inferior).

Please provide a clear description of problem this feature solves

The sample rate (fps) of videos may very and hence the time period a fixed number of frames represent also varies. Having access to either the fps, duration or even the concrete timesteps of each frame is often crucial in many tasks where the actual duration in seconds is more important than the number of frames. For example, I am decoding raw video bytes from a web dataset using the experimental video decoder and I am forced to retreat to other libraries that can give me this kind of information from the raw video bytes (specifically, pytorch's VideoReader API).

Feature Description

As a user I want to be able to extract information about the sample rate of a video alongside its decoded frames.

Describe your ideal solution

A new DALI operator that extracts the desired metadata from raw video bytes. An example video decoding pipeline reading from a webdataset (raw video bytes could also come from an external source):

@pipeline_def
def pipeline(tar_paths):
    raw_video = fn.readers.webdataset(tar_paths, ...)
    duration, fps = fn.get_video_metadata(...)
    video = fn.experimental.decoders.video(raw_video)
    return video, duration, fps

Describe any alternatives you have considered

No response

Additional context

No response

Check for duplicates

  • [x] I have searched the open bugs/issues and have found no duplicates for this bug report

tomresan avatar Sep 10 '24 15:09 tomresan

Hi @treasan,

Thank you for reaching out. Yes, that sounds like a good feature to add. Let us add this to our ToDo list. Could you also tell me how do you want to utilize this data further? To drive transformations or to feed the model?

JanuszL avatar Sep 10 '24 16:09 JanuszL

Hey @JanuszL

I am training a model, which expects video snippets with a certain duration (in seconds). Furthermore it expects a timestep for each frame, which is used for a temporal positional encoding.

tomresan avatar Sep 10 '24 18:09 tomresan

Thank you for the clarification. In this case, I think it would be best to return this data directly from the video decoder (at least timesteps for each frame), and or extend the decoder to decode not the number of frames but the duration.

JanuszL avatar Sep 10 '24 18:09 JanuszL

Hello @treasan

thanks for creating the issue. To better understand the requirement I wanted to ask do your use case expect the samples to have the same number of frames or the number of frames varies per sample. If it varies is it due to the variable frame rates in the video or variable duration of frames in seconds or both? If it varies what is expected type and shape of the output in your desired framework?

awolant avatar Sep 10 '24 19:09 awolant

Please have a look at another issue/question I have submitted #5626. I explain my pipeline there in more detail.

tl;dr:

  1. DALI pipeline: Loading raw video bytes from webdataset
  2. Python function: Peeking duration and fps metadata from raw video bytes and filter out unwanted videos beforehand (e.g. too short ones)
  3. DALI pipeline: Get raw video bytes, duration, fps from external source --> decode video --> return decoded video, duration, fps
  4. Python function: Cut out multiple consecutive snippets of certain duration (e.g. 3 secs) of respective videos based on fps/duration metadata. These snippets constitute one training sample. They get batched and fed to the model alongside their timesteps that were also calculated based on the fps/duration metadata.

So, optimal for my use-case would be a DALI operator that peeks this metadata from raw video bytes, as I am then able to filter them out before the decoding step (more efficient). This might be similar to the peek_image_shape operator, which gives certain information about an encoded image.

tomresan avatar Sep 10 '24 19:09 tomresan