SDL
SDL copied to clipboard
Performance issue on setup of SDL3 compared to SDL2
I started a new project to test a bit SDL3 and the first frame took a long time to load so I decided to measure run time of SDL_Init, SDL_CreateWindow and SDL_CreateRenderer to see which one was too slow :
PS C:\path\SDL3> .\build.bat
PS C:\path\SDL3> .\prog.exe
SDL version : 3, 0, 0
init time : 2
window time : 26
renderer time : 256
Average render time : 26.608 | average FPS : 37
PS C:\path\SDL3> .\build.bat
PS C:\path\SDL3> .\prog.exe
SDL version : 3, 0, 0
init time : 2
window time : 25
renderer time : 251
Average render time : 25.662 | average FPS : 38
The first run has no compiler optimization and the second run use -O3.
The code is the following
#include <chrono>
#include <iostream>
#include <SDL3/SDL.h>
int main(int argc, char *argv[])
{
SDL_version version {};
SDL_GetVersion(&version);
std::cout << "SDL version : " << (int)version.major << ", " << (int)version.minor << ", " << (int)version.patch << std::endl;
auto startTime = std::chrono::system_clock().now();
SDL_Init(SDL_INIT_VIDEO);
std::cout << "init time : " << std::chrono::duration_cast<std::chrono::milliseconds> (std::chrono::system_clock().now() - startTime).count() << std::endl;
startTime = std::chrono::system_clock().now();
SDL_Window *window {SDL_CreateWindow("NickelLib", 16 * 70, 9 * 70, 0)};
std::cout << "window time : " << std::chrono::duration_cast<std::chrono::milliseconds> (std::chrono::system_clock().now() - startTime).count() << std::endl;
startTime = std::chrono::system_clock().now();
SDL_Renderer *renderer {SDL_CreateRenderer(window, nullptr, SDL_RENDERER_ACCELERATED)};
std::cout << "renderer time : " << std::chrono::duration_cast<std::chrono::milliseconds> (std::chrono::system_clock().now() - startTime).count() << std::endl;
// ...
}
And this is how I compile (I use MinGW on Windows 11)
g++ -std=c++23 -I include src/*.cpp -L lib -lmingw32 -lSDL3 -o prog.exe (-O3)
I write the same code but with SDL2 and this is the result :
PS C:\path\SDL2> .\build.bat
PS C:\path\SDL2> .\prog.exe
SDL version : 2, 0, 20
init time : 2
window time : 24
renderer time : 69
Average render time : 5.178 | average FPS : 193
PS C:\path\SDL2> .\build.bat
PS C:\path\SDL2> .\prog.exe
SDL version : 2, 0, 20
init time : 2
window time : 25
renderer time : 72
Average render time : 4.411 | average FPS : 226
The code :
#include <chrono>
#include <iostream>
#include <SDL2/SDL.h>
int main(int argc, char *argv[])
{
SDL_version version {};
SDL_GetVersion(&version);
std::cout << "SDL version : " << (int)version.major << ", " << (int)version.minor << ", " << (int)version.patch << std::endl;
auto startTime = std::chrono::system_clock().now();
SDL_Init(SDL_INIT_VIDEO);
std::cout << "init time : " << std::chrono::duration_cast<std::chrono::milliseconds> (std::chrono::system_clock().now() - startTime).count() << std::endl;
startTime = std::chrono::system_clock().now();
SDL_Window *window {SDL_CreateWindow("NickelLib", SDL_WINDOWPOS_CENTERED, SDL_WINDOWPOS_CENTERED, 16 * 70, 9 * 70, 0)};
std::cout << "window time : " << std::chrono::duration_cast<std::chrono::milliseconds> (std::chrono::system_clock().now() - startTime).count() << std::endl;
startTime = std::chrono::system_clock().now();
SDL_Renderer *renderer {SDL_CreateRenderer(window, -1, SDL_RENDERER_ACCELERATED)};
std::cout << "renderer time : " << std::chrono::duration_cast<std::chrono::milliseconds> (std::chrono::system_clock().now() - startTime).count() << std::endl;
// ...
}
The compilation :
g++ -std=c++23 -I include src/*.cpp -L lib -lmingw32 -lSDL2main -lSDL2 -o prog.exe (-O3)
Can you tell us what renderers it landed in for SDL3 and SDL2?
SDL_RendererInfo info;
SDL_GetRendererInfo(renderer,&info);
std::cout << "renderer chosen: " << info.name << std::endl;
(this code snippet should work for SDL2 and SDL3.)
For SDL3, direct3d12 and for SDL2 direct3d
Sorry to make you keep going back to test this again, but can you force the direct3d renderer in SDL3 and see if the bottleneck vanishes?
// insert anywhere before calling SDL_CreateRenderer
SDL_SetHint(SDL_HINT_RENDER_DRIVER, "direct3d");
(and if you're inclined, try forcing this to "direct3d12" in SDL2 and see if the bottleneck appears.)
Forcing direct3d on SDL3 reduce the SDL_CreateRenderer call to 58ms (a bit faster than SDL2) and if I try to force direct3d12 on SDL2, the renderer is still "direct3d" (and my graphics card is an RX 6700XT so direct3d12 is supported)
I also noticed similar things. Have a sort of crappy windows pc with only an intel iGPU though. The window opens, but it takes a noticeable time until the first frame gets drawn. So it would either be SDL_CreateRenderer or the first SDL_Render* calls (SetRenderDrawColor, RenderClear, RenderPresent) that are slow.
Any chance you can debug this and narrow down which call is slow in the D3D 12 path?
I can't reproduce this here. Has anyone been able to narrow down what is slow on your system?
Ok I think I've find the bottleneck, but because of my lack of knowledge about D3D12, I'm unable to fix it myself. After going through the code and measuring random part of the D3D12 path, I find that the pipeline stage creation loop in D3D12_CreateDeviceResources takes about 100ms to run (measured with SDL_GetTicks) for 1100 different pipeline stage infos.
I also find that the vertex buffer creation takes about 200ms to generate 256 vertex buffers.
I get that pipelines are fixed in "new" api like D3D12 and Vulkan, but 1100 seems a bit excessive for something like SDL3. And wouldn't it be better to just generate one big vertex buffer instead of 256 allocations ? Please correct me if I'm wrong.
Yeah, that seems excessive. I haven't looked at this code, so there might be a good reason, but we should probably build these on demand.
Ok so your change fixed most of the bottleneck but there is still the vertex buffer creation that takes some time (about 150ms) even the startup time became barely noticeable.
I don't think we can defer the vertex buffer creation, so I think we're as good as we can get for now. Out of curiosity, what hardware and driver version do you have?
I have an AMD Radeon RX 6700XT (driver version: 22.20.27.09-230330a-390451C) with a ryzen 7 5800x. I think I had the same bottleneck on my laptop with an RTX 4060 but I have to check
I don't think we can defer the vertex buffer creation, so I think we're as good as we can get for now
Isn't it possible to just allocate the memory for all of them at once and just create the buffer as with Vulkan ? Or maybe it is already what has been done and it's just vertex buffer creation that takes a lot of time ?
What's the reason for the D3D12 backend creating 256 individual buffers at startup? That's pretty unusual compared to some of the other render backends I'm familiar with (like Metal's).
The initial creation code also makes them all 64 KB which is a little strange, and it uses CreateCommittedResource which I think can be pretty slow because it creates separate heaps for each individual allocation.
After testing, pipeline states make no big difference on my laptop but I had an other bottleneck (maybe I should create an other issue for it ?). The D3D12CreateDeviceFunc take an absurd amount of time to execute (like more than 600ms).
This device have an RTX 4060 (driver version: 536.23) with a Ryzen 7 7840HS. It has an integrated GPU (an AMD Radeon 780M on driver version 22.40.03.60-230627a-393573C-Lenovo)
What's the reason for the D3D12 backend creating 256 individual buffers at startup? That's pretty unusual compared to some of the other render backends I'm familiar with (like Metal's).
The initial creation code also makes them all 64 KB which is a little strange, and it uses
CreateCommittedResourcewhich I think can be pretty slow because it creates separate heaps for each individual allocation.
That's a good question. I'm leaving this open, and maybe @chalonverse has some insight here?
It's been a while since I wrote this code, and I admittedly am not a D3D12 expert, but here are my thoughts...
Pipeline States
I 100% agree with the idea to not create all the 1000+ pipeline states at the start. I did that initially because I wanted to avoid any potential pipeline creation hitches while rendering, but it totally makes sense that most games are really only going to use a very small number of pipeline states. So the hitches would be minimal and only on initial frames.
Vertex Buffers
For the vertex buffers, I'm pretty sure part of the reason is that in D3D12, it is not safe to reuse the same vertex buffer on the same frame. I actually think the way it's written currently, it may not be guaranteed to work correctly if you need more than 256. I think if you use batched rendering, though, you never will get anywhere close to the 256 number -- if unbatched isn't officially supported anymore than maybe you could cut down that number significantly.
I checked the Metal renderer, and yeah it just creates each of its vertex buffers on demand as in the Metal renderer, but you'll notice there's actually an old FIXME comment questioning where it's better to make a ring of vertex buffers instead: https://github.com/libsdl-org/SDL/blob/a6374123c7798198855a2aebc8906eb46bbd8a3d/src/render/metal/SDL_render_metal.m#L1325
That's what I tried to do in D3D12, but yeah, it's obviously a trade-off that if you are making a ring buffer at the start you're going to have a startup cost. Maybe the compromise is to reduce the 256 number, but I'm not sure off the top of my head what a good number is. But we could also just test what happens if create the vertex buffers on demand like in Metal and look at the performance. We could also make it so it still builds out a ring buffer but does so on demand instead of all at the start.
Regarding the heap size, I'm pretty sure 64KB is actually the minimum size of a heap: https://learn.microsoft.com/en-us/windows/win32/api/d3d12/ns-d3d12-d3d12_heap_desc. I made them Committed Resources with their own individual heaps instead of Placed Resources, because it's simpler to have them be separate heaps, and I wasn't concerned about the cost of creating them since I was doing it mostly at startup. But maybe if we were to instead implement on-demand creation, we probably would want to create placed resources on the same vertex buffer heap (that we reuse every frame) for performance.
The way I would probably do things is something like: have a single vertex buffer for the entire frame. Treat its contents like a big ring buffer, where it'd be recreated with a larger size inside RunCommandQueue if the current offset plus vertex array size doesn't fit (with the old one being destroyed in a later frame instead of immediately, so the GPU doesn't reference deleted data while executing this frame's commands). It'd cycle between 3ish vertex buffers total, with that cycle happening in RenderPresent (to avoid that problem of modifying data that the GPU is currently using).
Metal could use the same technique - but I think buffer creation is pretty fast in general on Metal so doing that sort of optimization hasn't been as high of a priority.
It might even make sense to have a higher level abstraction of that technique that multiple backends can use... but maybe that won't be super useful if SDL_gpu is going to replace several backends anyway.
All that being said... it sounds like figuring out why D3D12CreateDevice is taking a long time might be more worthwhile for now.
Oh, yeah, that sounds like a good idea. It would require a decent amount of additional bookkeeping, but I think conceptually that would work. And you would have to figure out how to "grow" properly on the same frame because hypothetically, if the vertex buffer is 10k verts and you go over, on the same frame you'd have to realloc the next vertex buffer in the ring to be bigger and then also copy over the 10k verts you've already put in there. I wouldn't have time to make this change right now, but I do think it's worth trying and seeing what the perf change is.
As for D3D12CreateDevice taking too long, I'm not really sure what control we'd have over it -- it mostly is just going to be a driver thing, we don't really do anything crazy with our call, unless requesting a min feature level of 11 is wrong? I would say maybe to double-check that the SDL_HINT_RENDER_DIRECT3D11_DEBUG hint isn't set, because I'd guess making a debug device would be more expensive than a non-debug one?
if the vertex buffer is 10k verts and you go over, on the same frame you'd have to realloc the next vertex buffer in the ring to be bigger and then also copy over the 10k verts you've already put in there.
If draws using that buffer have already been added to the D3D command list then nothing special should be needed (ie it doesn't need to mimic realloc by copying what was there before), since new draws using the new buffer won't need to reference old data. I think that will always be the case since the vertex buffer is only ever used in RunCommandQueue - but I haven't implemented the idea for SDL_Render (just other things) so it's likely there are details I haven't thought of. :)