server
server copied to clipboard
[feature request] ffmpeg backend for simplifying decoding of audio/video inputs
https://github.com/triton-inference-server/dali_backend/ is awesome for reading and preprocessing images
It would be nice to have a more developed builtin solution for decoding audio/videos.
Currently, of course one can do audio/video decoding in Python backend invoking ffmpeg libraries under the hood, but if we want to process very long audio or video, it might be nice to have a proper streaming capability (to start executing the models without waiting for full decoding of the whole input file)
Another question is failure / SEGFAULT handling. audio/video decoders like ffmpeg can have nasty bugs and crashes, so it is nice to have some answers about reliability and automatic crash recovery (and also questions on process/memory/cgroups isolation if there are any RCE bugs in decoders).
Another useful feature is to limit the max resources (compute time, memory, etc) being used by decoder to protect oneself from "zipbombs" or again some RCE/code execution bugs in parsers/decoders.
So having a single, well-tested solution as a core backend might be beneficial to many
I believe DALI should also be helpful for audio and video data. @szalpal , could you please recommend something?
DALI supports some audio formats (flac, wav, opus), but not all. And it has other important limitations (e.g. https://github.com/NVIDIA/DALI/issues/5597)
A simple, hackable pure-ffmpeg-powered (compiled agains libavcodec/libavformat etc) backend would still be very useful for many scenarios where we have to deal with less constrained inputs, or would like to leverage legacy ffmpeg filter graphs
If DALI / ffmpeg / libsndfile are supported, would be nice if they allowed to specify offset/length/channels to cap the returned output's size, and also always returned file metadata (e.g. full duration) - to allow the user do the loop over channels and duration chunks
And I guess, the same functionality is important for video