SDL testffmpeg poor performance

When evaluating the performance of multiple hardware video decode playback software on linux using a custom ffmpeg based hardware decoder (Tegra NVDEC) I have recorded a stark difference between the performance of the various software.

For reference, Nvidia provides an example software 00_video_decode (https://docs.nvidia.com/jetson/l4t-multimedia/l4t_mm_00_video_decode.html) that plays back a video by directly interacting with X11. 00_video_decode is able to DECODE a 4K video at 97fps (when disabling vsync, a locked 60fps with vsync enabled). This is the baseline. When enabling the DISPLAY function this drops to 91fps.

We have a ffmpeg decoder based off of this that can achieve 85fps when not displaying video (lower than baseline for conversion needed for better media playback compatibility).

Using ffplay (which uses an SDL2 based renderer) with this decoder, the achieved playback is only 28fps. Using testffmpeg with this decoder, the achieved playback is only 43fps.

testffmpeg has a playback performance of 1/2 the decoders capabilities.

the example command (note that performance is not negatively affected by mangohud, it is simply used as a means to easily monitor and record the framerate, the video is a 4K 60 vp9 video from youtube)

__GL_SYNC_TO_VBLANK=0 mangohud --dlsym SDL_RENDER_DRIVER=opengl ./test/testffmpeg '/tmp/Costa Rica in 8K ULTRA HD HDR - The Rich Coast (60 FPS) [rZ4uXL9CXOs].webm'

I am looking for thoughts or assistance in troubleshooting this discrepancy in playback performance.

The same hardware as mentioned in this bug report is what I am using here https://github.com/libsdl-org/SDL/issues/10470#issue-2447267993

Aug 20 '24 19:08 theofficialgman

The accelerated rendering path will only work with OpenGL ES, so you're getting a slow copy to CPU and then back to GPU for rendering.

I don't know anything about your setup, but you should debug testffmpeg and see if AV_PIX_FMT_DRM_PRIME is being used.

Aug 20 '24 19:08 slouken

Looking at the NVIDIA sample, they're using OpenGL ES internally, in the same way testffmpeg would: https://docs.nvidia.com/jetson/l4t-multimedia/classNvEglRenderer.html

Aug 20 '24 19:08 slouken

A quick test might be:

__GL_SYNC_TO_VBLANK=0 mangohud --dlsym SDL_RENDER_DRIVER=opengles2 ./test/testffmpeg '/tmp/Costa Rica in 8K ULTRA HD HDR - The Rich Coast (60 FPS) [rZ4uXL9CXOs].webm'

Aug 20 '24 19:08 slouken

The accelerated rendering path will only work with OpenGL ES, so you're getting a slow copy to CPU and then back to GPU for rendering.

I guess this bug needs to be fixed then since that path is currently broken in testffmpeg https://github.com/libsdl-org/SDL/issues/10470

Aug 20 '24 19:08 theofficialgman

A quick test might be:

__GL_SYNC_TO_VBLANK=0 mangohud --dlsym SDL_RENDER_DRIVER=opengles2 ./test/testffmpeg '/tmp/Costa Rica in 8K ULTRA HD HDR - The Rich Coast (60 FPS) [rZ4uXL9CXOs].webm'

see above

Aug 20 '24 20:08 theofficialgman

A quick test might be:

__GL_SYNC_TO_VBLANK=0 mangohud --dlsym SDL_RENDER_DRIVER=opengles2 ./test/testffmpeg '/tmp/Costa Rica in 8K ULTRA HD HDR - The Rich Coast (60 FPS) [rZ4uXL9CXOs].webm'

since SDL_VIDEO_FORCE_EGL=1 (the default) is currently broken, I was able to use

_GL_SYNC_TO_VBLANK=0 mangohud --dlsym SDL_VIDEO_FORCE_EGL=0 ./test/testffmpeg '/tmp/Costa Rica in 8K ULTRA HD HDR - The Rich Coast (60 FPS) [rZ4uXL9CXOs].webm'

opengles was used however performance is still 43fps

INFO: Created renderer opengles2
INFO: Video stream: vp9 3840x2160
Opening in BLOCKING MODE 
INFO: ffmpeg verbose: Old NvBuffer Utils version
NvMMLiteOpen : Block : BlockType = 280 
NVMEDIA: Reading vendor.tegra.display-size : status: 6 
NvMMLiteBlockCreate : Block : BlockType = 280 
INFO: ffmpeg verbose: Starting capture thread
INFO: ffmpeg verbose: Resolution changed to: 3840x2160
INFO: ffmpeg verbose: Colorspace ITU-R BT.601 with standard range luma (16-235)
INFO: ffmpeg verbose: Query and set capture successful
INFO: ffmpeg verbose: Resource unavailable!
INFO: ffmpeg verbose: Resource unavailable!
INFO: ffmpeg verbose: Exiting decoder capture loop thread
INFO: ffmpeg verbose: Decoder Run was successful
INFO: ffmpeg verbose: Statistics: 8278321 bytes read, 0 seeks

Aug 20 '24 20:08 theofficialgman

incase it helps, this is nvidia's eglrenderer library

NvEglRenderer.zip

this comes from https://repo.download.nvidia.com/jetson/t210/pool/main/n/nvidia-l4t-jetson-multimedia-api/nvidia-l4t-jetson-multimedia-api_32.3.1-20191209225816_arm64.deb

Aug 20 '24 20:08 theofficialgman

You'll need to debug and see what's happening. I can't tell from here what might be going on.

Aug 20 '24 20:08 slouken

the main thing I can tell right of the bat is the testffmpeg thread is at 100% cpu utilization (on one thread) while 00_video_decode is nowhere near (20%). this probably indicates one or mulitple cpu copies of the video frame occuring.

Aug 20 '24 20:08 theofficialgman

I don't know anything about your setup, but you should debug testffmpeg and see if AV_PIX_FMT_DRM_PRIME is being used.

AV_PIX_FMT_DRM_PRIME is not used because the ffmpeg decoder used does not have a special hardware context frame format (frame->format) available (the frame format and pixel format are the same). I assume this is causing extra overhead that nvidia's video_decode example does not have because it does use glEGLImageTargetTexture2DOES. testffmpeg only uses glEGLImageTargetTexture2DOES when one of the ffmpeg hw contexts are available (AV_PIX_FMT_DRM_PRIME, AV_PIX_FMT_VAAPI, etc)

Aug 20 '24 21:08 theofficialgman

I'm not sure where it's getting the DMABUF file descriptor, but this is the function turning that into an EGL image: NvEGLImageFromFd()

Obviously testffmpeg isn't set up to use that function, but you could create a function to import the file descriptor from wherever it came from, e.g.

static SDL_bool GetOESTextureForFD(int fd, SDL_Texture **texture)
{
    EGLDisplay display = eglGetCurrentDisplay();
    SDL_PropertiesID props;
    GLuint textureID;

    if (!*texture) {
        *texture = SDL_CreateTexture(renderer, SDL_PIXELFORMAT_EXTERNAL_OES, SDL_TEXTUREACCESS_STATIC, width, height);
        if (!*texture) {
            return SDL_FALSE;
        }
        SDL_SetTextureBlendMode(*texture, SDL_BLENDMODE_NONE);
        SDL_SetTextureScaleMode(*texture, SDL_SCALEMODE_LINEAR);
    }

    props = SDL_GetTextureProperties(*texture);
    textureID = (GLuint)SDL_GetNumberProperty(props, SDL_PROP_TEXTURE_OPENGLES2_TEXTURE_NUMBER, 0);
    if (!textureID) {
        SDL_SetError("Couldn't get OpenGL texture");
        return SDL_FALSE;
    }

    EGLImage image = NvEGLImageFromFd(display, fd);
    if (image == EGL_NO_IMAGE) {
        SDL_Log("Couldn't create image: %d\n", glGetError());
        return SDL_FALSE;
    }

    glActiveTextureARBFunc(GL_TEXTURE0_ARB);
    glBindTexture(GL_TEXTURE_EXTERNAL_OES, textureID);
    glEGLImageTargetTexture2DOESFunc(GL_TEXTURE_EXTERNAL_OES, image);
    return SDL_TRUE;
}

(completely untested, of course)

Creating the texture and getting the texture ID is one time setup that you could move out, depending on how you structure your code.

Aug 20 '24 23:08 slouken

We are scoping work for the SDL 3.2.0 release, so please let us know if this is a showstopper for you.

Oct 06 '24 19:10 slouken