SDL icon indicating copy to clipboard operation
SDL copied to clipboard

SDL3: Bugs with SDL_SetRenderVSync()/SDL_App* callbacks on Windows

Open nightmareci opened this issue 1 year ago • 15 comments

High CPU usage can be observed on Windows with vsync at 1, after making this change to the snake game example. The high CPU usage might be related to the code driving the SDL_App* callbacks, though:

// ...
    if (!SDL_CreateWindowAndRenderer("examples/game/snake", SDL_WINDOW_WIDTH, SDL_WINDOW_HEIGHT, 0, &as->window, &as->renderer)) {
        return SDL_APP_FAILURE;
    }
    if (!SDL_SetRenderVSync(as->renderer, 1)) {
        return SDL_APP_FAILURE;
    }
// ...

Setting vsync to SDL_WINDOW_SURFACE_VSYNC_DISABLED works fine, no unusual high CPU usage then. I've also seen some other issues with SDL_SetRenderVSync() on Windows, like all vsync values besides SDL_WINDOW_SURFACE_VSYNC_DISABLED and 1 not working (That operation is not supported returned by SDL_GetError()).

nightmareci avatar Oct 06 '24 19:10 nightmareci

I've just starting trying out the SDL_App* callbacks. In my own basic app everything seems okay, only CPU & GPU high usage on "0" vsync, as expected. I am getting a "op not supported" error on adaptive vsync (-1) though, due to this (is this how it should work?):

static bool D3D11_SetVSync(SDL_Renderer *renderer, const int vsync)
{
    D3D11_RenderData *data = (D3D11_RenderData *)renderer->internal;

    if (vsync < 0) {
        return SDL_Unsupported();
    }

Just tried snake. It's thrashing the CPU & GPU by default. But it works fine when inserting SDL_SetRenderVSync(as->renderer, 1), so I'm not seeing the same there.

[Stupid question alert] The snake game's AppState data - isn't that at risk of being accessed by two threads at the same time if SDL_AppIterate and SDL_AppEvent are running together on different threads? "There is (currently) no guarantee about what thread this will be called from"

AntTheAlchemist avatar Oct 06 '24 22:10 AntTheAlchemist

[Stupid question alert] The snake game's AppState data - isn't that at risk of being accessed by two threads at the same time if SDL_AppIterate and SDL_AppEvent are running together on different threads? "There is (currently) no guarantee about what thread this will be called from"

Yeah, I've wondered about that aspect of the snake game's code. The event function is writing, iterate function reading, so that's a race condition if the functions are on different threads, as far as I can tell. A potential "easy fix" is to force a shared mutex lock for both functions, so they're kept synchronized, either on the user's side, or SDL's side.

nightmareci avatar Oct 07 '24 21:10 nightmareci

@icculus, what's our guarantee about event dispatch with respect to app iteration? If it's guaranteed to be multi-threaded on some platforms, that's going to be really awkward for application code.

slouken avatar Oct 07 '24 22:10 slouken

I think AppEvent and AppIterate already serialize...? If it doesn't, it's a bug.

icculus avatar Oct 07 '24 23:10 icculus

Also, if vsync==1 and the CPU usage is high, then a) the OS is busywaiting for the sync event or b) it didn't work at all and you're running frames as fast as possible.

If setting vsync==0 causes CPU to reduce, too, something somewhere went really wrong.

icculus avatar Oct 07 '24 23:10 icculus

I've just re-read the SDL_AppEvent docs with a better understanding (apologies - I'm new at threads and still learning). Serialisation is mentioned, so there is no data race issue - my question was indeed stupid.

AntTheAlchemist avatar Oct 07 '24 23:10 AntTheAlchemist

I haven't tried this on Windows yet, but here's a test app:

#define SDL_MAIN_USE_CALLBACKS 1
#include <SDL3/SDL.h>
#include <SDL3/SDL_main.h>

// This is probably a deeply stupid way to do this.
typedef struct FpsMetrics
{
    Uint64 rolling_ticks;  // sum of the history.
    Uint64 rolling_iterations; // sum of the history.
    Uint64 start_ticks;
    Uint64 iterations;
    Uint64 tick_history[8];
    Uint64 iteration_history[8];
    int history_idx;
} FpsMetrics;

static Uint64 fps_iteration(FpsMetrics *metrics)
{
    const Uint64 now = SDL_GetTicks();
    Uint64 elapsed = now - metrics->start_ticks;

    metrics->iterations++;

    if (elapsed >= 125) {
        const Uint64 iterations = metrics->iterations;
        const size_t histlen = SDL_arraysize(metrics->tick_history);
        const int idx = metrics->history_idx;
        metrics->rolling_ticks -= metrics->tick_history[idx];
        metrics->rolling_ticks += elapsed;
        metrics->rolling_iterations -= metrics->iteration_history[idx];
        metrics->rolling_iterations += iterations;
        metrics->tick_history[idx] = elapsed;
        metrics->iteration_history[idx] = iterations;
        metrics->start_ticks = now;
        metrics->iterations = 0;
        metrics->history_idx++;
        //printf("FPS %p rolling_ticks=%u, rolling_iterations=%u\n", metrics, (unsigned int) metrics->rolling_ticks, (unsigned int) metrics->rolling_iterations);
        if (metrics->history_idx >= histlen) {
            metrics->history_idx = 0;
        }
        elapsed = 0;
    }

    const Uint64 ticks = metrics->rolling_ticks + elapsed;
    const Uint64 iterations = metrics->rolling_iterations + metrics->iterations;
    return ticks ? ((Uint64) ((((double) iterations) / (((double) ticks) / 1000.0)) + 0.5)) : 0;
}

static SDL_Window *window = NULL;
static SDL_Renderer *renderer = NULL;
static FpsMetrics fps_metrics;
static int vsync = 1;

SDL_AppResult SDL_AppInit(void **appstate, int argc, char *argv[])
{
    if (!SDL_Init(SDL_INIT_VIDEO)) {
        SDL_Log("Couldn't initialize SDL: %s", SDL_GetError());
        return SDL_APP_FAILURE;
    } else if (!SDL_CreateWindowAndRenderer("testvsync", 640, 480, 0, &window, &renderer)) {
        SDL_Log("Couldn't create window/renderer: %s", SDL_GetError());
        return SDL_APP_FAILURE;
    }

    SDL_SetRenderVSync(renderer, vsync);

    return SDL_APP_CONTINUE;
}

SDL_AppResult SDL_AppEvent(void *appstate, SDL_Event *event)
{
    if (event->type == SDL_EVENT_QUIT) {
        return SDL_APP_SUCCESS;
    } else if (event->type == SDL_EVENT_KEY_DOWN) {
        if (event->key.key == SDLK_ESCAPE) {
            return SDL_APP_SUCCESS;
        } else if (event->key.key == SDLK_V) {
            vsync = vsync ? 0 : 1;
            SDL_SetRenderVSync(renderer, vsync);
        }
    }
    return SDL_APP_CONTINUE;
}

SDL_AppResult SDL_AppIterate(void *appstate)
{
    char vsyncstr[64];
    char fpsstr[64];
    int w, h;
    float x, y;

    SDL_snprintf(vsyncstr, sizeof (vsyncstr), "vsync: %s", vsync ? "ON" : "OFF");
    SDL_snprintf(fpsstr, sizeof (fpsstr), "FPS: %" SDL_PRIu64, fps_iteration(&fps_metrics));

    SDL_GetRenderOutputSize(renderer, &w, &h);

    SDL_SetRenderDrawColor(renderer, 0, 0, 0, 255);
    SDL_RenderClear(renderer);
    SDL_SetRenderDrawColor(renderer, 255, 255, 255, 255);

    x = (float) ((w - ((int) (9 * SDL_DEBUG_TEXT_FONT_CHARACTER_SIZE))) / 2);
    y = (float) ((h - SDL_DEBUG_TEXT_FONT_CHARACTER_SIZE) / 4);
    SDL_RenderDebugText(renderer, x, y, vsyncstr);

    x = (float) ((w - ((int) (8 * SDL_DEBUG_TEXT_FONT_CHARACTER_SIZE))) / 2);
    y = (float) ((h - SDL_DEBUG_TEXT_FONT_CHARACTER_SIZE) / 2);
    SDL_RenderDebugText(renderer, x, y, fpsstr);

    x = (float) ((w - ((int) (24 * SDL_DEBUG_TEXT_FONT_CHARACTER_SIZE))) / 2);
    y += (float) ((h - SDL_DEBUG_TEXT_FONT_CHARACTER_SIZE) / 4);
    SDL_RenderDebugText(renderer, x, y, "Press V to toggle vsync.");

    SDL_RenderPresent(renderer);

    return SDL_APP_CONTINUE;
}

void SDL_AppQuit(void *appstate, SDL_AppResult result)
{
}

On x11 (well, XWayland), this gets like 3500+ fps on my laptop and eats like 65% of the CPU without vsync, and with vsync, it stays a solid 60fps and uses 3-5% of the CPU.

If it does the right thing on Windows, we can probably close this.

icculus avatar Oct 22 '24 19:10 icculus

I tested that on Windows, it produces high CPU usage with vsync both on or off, using the same amount of high CPU percentage in either case, but displays 165 FPS with either setting (my monitor was running at 165 Hz while testing). One odd thing though, holding the mouse button on the title bar and moving the window around, the FPS goes down to ~60 FPS and CPU usage goes to 0%.

I also tested it on the same machine running Linux, both with X11/XWayland and native Wayland (SDL_VIDEO_DRIVER=wayland), I got unbounded FPS with vsync off like you reported, but with vsync on it runs at 165 FPS (my monitor's set refresh rate during testing, again).

Isn't the X11/Wayland behavior incorrect? I thought that the iterate callback was paced out by SDL using SDL_DelayPrecise(), controlled by SDL_HINT_MAIN_CALLBACK_RATE, with that at 60 FPS by default? And since your code doesn't change the callback rate, it should be running at 60 FPS, even on my setup, with 165 Hz refresh rate.

nightmareci avatar Oct 22 '24 23:10 nightmareci

I tested it on macos (mbp M2 pro) and got some weird behaviour. I have two screens connected, one 120 hz (the laptop screen) and one 60 hz.

With vsync enabled, it syncs to 60 on the 60hz screen and 120 on the 120hz screen, and it adapts as I move the window between screens. That's nice! It uses around 8-10% cpu for 60hz, but WindowServer uses 30-40% on top of that.

With vsync disabled, without moving between screens with vsync off, it fluctuates between 300 and 1000 fps on either screen, usually hovering around 5-600. It seems a little low for something as simple as this. It uses ~50% cpu.

When I move the window between screens with vsync disabled, it syncs to the screens's refresh rate after being moved to the new screen. Turning vsync on or off has no effect after this. Actually, after having done this it always syncs to 60 on the 60hz screen with vsync disabled even without moving the window after quitting and restarting the test program. On the 120hz screen it still runs unlimited with vsync off.

maia-s avatar Oct 23 '24 14:10 maia-s

I think I've narrowed down what's causing this issue: Iterations aren't being paced out at all, in the generic main callbacks implementation; that line's if-expression I linked to is always true, when any windows exist. So, the only "pacing" we're observing, where the generic implementation is in use, is whatever rate is produced by blocking vsync, so the callback rate hint ends up entirely unused.

I propose that the callback rate be respected and maintained with SDL_DelayPrecise() either when no windows exist or when all windows are using some combination of SDL_WINDOW_SURFACE_VSYNC_DISABLED and SDL_WINDOW_SURFACE_VSYNC_ADAPTIVE. Should there be no pacing at all when the callback rate hint is 0, so without vsync, the user's iteration callback is expected to do the pacing logic? But, with nonzero callback rate, there's some discussion to be had around what pacing behavior should be produced when vsync is in effect, or there are multiple windows on different displays with different refresh rates while vsync is in effect on some/all windows.

It might even make sense to let the user know when to present/not present. E.g., adding a new API returning a bool, maybe even taking a window, if true then "you can present this iteration to get the callback rate requested, because vsync is on right now", to get the correct behavior with a callback rate of 120 with 60 Hz vsync (rendering every iteration, the iteration callback's own code having no timing logic, the callback rate would never be observed); that function would also allow the user to optimize rendering per window, not rendering to and presenting to a window at all when it returns false for a window. Or the functions that normally introduce blocking vsync don't present/block, the iteration callback driver causing blocking presents per window when due, for the sake of getting the callback rate. I'm not familiar with how vsync blocking works with multiple windows each with vsync on in a single threaded program though, do you get a block for each window present, introducing a lot of delay in the iteration callback's thread (60 Hz on two displays with a window on each display, you get 2 / 60 seconds of delay when presenting to both windows in one iteration)?

nightmareci avatar Oct 24 '24 16:10 nightmareci

I don't think we're going to do this; we don't promise any sort of consistency in callback frequency (and even outside of vsync, we assume some platforms will refuse to be consistent for various reasons).

icculus avatar Dec 04 '24 19:12 icculus

I tested it on macos (mbp M2 pro) and got some weird behaviour. I have two screens connected, one 120 hz (the laptop screen) and one 60 hz.

When I move the window between screens with vsync disabled, it syncs to the screens's refresh rate after being moved to the new screen. Turning vsync on or off has no effect after this. Actually, after having done this it always syncs to 60 on the 60hz screen with vsync disabled even without moving the window after quitting and restarting the test program. On the 120hz screen it still runs unlimited with vsync off.

Can you report this as a separate bug? Thanks!

slouken avatar Dec 04 '24 19:12 slouken

Actually, let me reopen this and clarify.

icculus avatar Dec 04 '24 19:12 icculus

So we have the hint (SDL_HINT_MAIN_CALLBACK_RATE) that controls this, but it is ignored if there's a window created...the original use was for things like loopwave.c that are just burning CPU unnecessarily without it.

But this hint should probably be respected if it is explicitly set, window or not.

With a disclaimer: this hint is going to be ignored on some platforms (like Emscripten) where we don't drive the mainloop, and if a platform wants to put it own limits on things (like if macOS decides to slow down a window in the background), we're at its mercy.

So there are two fixes to make: don't ignore the hint if its set, regardless of window, and update the documentation for that hint to warn about its limitations (it can mostly work but it's not a guarantee).

icculus avatar Dec 04 '24 19:12 icculus

That seems fine to me. I've been working with the app callbacks and can get the timing behavior I want on the big three desktop platforms, Windows/macOS/Linux, using a combination of SDL_DelayPrecise() and the fixed timestep algorithm.

I understand that the timing behavior of iterations, considered across all platforms, can't be guaranteed, so you can't have strict one-to-one correspondence of one game logic tick with one iteration, you must check the time and update logic appropriately. So, to meet my own desire for pedantically perfect mapping of inputs to logic ticks, I could match input events to logic ticks based on event arrival time, rather than just applying all events received in an iteration to all logic ticks processed in an iteration, which would only be pedantically correct if one iteration always processes one logic tick.

I think it'd be worthwhile to implement fixed timestep logic in iteration timing. Though maybe don't have delay skipping; if the accumulator exceeds the delay duration, just reset the accumulator. The accumulator can end up getting large when suspending/resuming a device, so it's not acceptable to skip a bunch of iteration delays in such a situation. And besides, the iteration timing isn't supposed to be relied upon for game logic timing. With the timing made more stable with the fixed timestepping applied, adaptive sync setups get slightly more consistent presentation pacing, when the game isn't overloading the system.

nightmareci avatar Dec 04 '24 19:12 nightmareci