server icon indicating copy to clipboard operation
server copied to clipboard

[feature request] ffmpeg backend for simplifying decoding of audio/video inputs

Open vadimkantorov opened this issue 1 year ago • 1 comments

https://github.com/triton-inference-server/dali_backend/ is awesome for reading and preprocessing images

It would be nice to have a more developed builtin solution for decoding audio/videos.

Currently, of course one can do audio/video decoding in Python backend invoking ffmpeg libraries under the hood, but if we want to process very long audio or video, it might be nice to have a proper streaming capability (to start executing the models without waiting for full decoding of the whole input file)

Another question is failure / SEGFAULT handling. audio/video decoders like ffmpeg can have nasty bugs and crashes, so it is nice to have some answers about reliability and automatic crash recovery (and also questions on process/memory/cgroups isolation if there are any RCE bugs in decoders).

Another useful feature is to limit the max resources (compute time, memory, etc) being used by decoder to protect oneself from "zipbombs" or again some RCE/code execution bugs in parsers/decoders.

So having a single, well-tested solution as a core backend might be beneficial to many

vadimkantorov avatar Sep 19 '24 09:09 vadimkantorov

I believe DALI should also be helpful for audio and video data. @szalpal , could you please recommend something?

oandreeva-nv avatar Sep 27 '24 22:09 oandreeva-nv

DALI supports some audio formats (flac, wav, opus), but not all. And it has other important limitations (e.g. https://github.com/NVIDIA/DALI/issues/5597)

A simple, hackable pure-ffmpeg-powered (compiled agains libavcodec/libavformat etc) backend would still be very useful for many scenarios where we have to deal with less constrained inputs, or would like to leverage legacy ffmpeg filter graphs

vadimkantorov avatar Feb 03 '25 14:02 vadimkantorov

If DALI / ffmpeg / libsndfile are supported, would be nice if they allowed to specify offset/length/channels to cap the returned output's size, and also always returned file metadata (e.g. full duration) - to allow the user do the loop over channels and duration chunks

And I guess, the same functionality is important for video

vadimkantorov avatar Sep 19 '25 01:09 vadimkantorov