[rlsw] Potential performance improvements of software renderer
I took a look over the rlsw software renderer and did some quick profiling, and I have some questions on the overall design, specifically how it handles colors and texture buffers.
I see that all the colors are passed as a float array. This means that the system is doing many conversions for every pixel operation. This is slow.
Why does swTexImage2D not convert to a format that is compatible with the frame buffers so that the color data is already in the correct format so that sampling and blitting doesn't need any more math? All the pixel formats are lerpable as bytes, it seems there would be a significant speedup to cache the textures in a format that requires little to no conversion to the output buffer format.
The OpenGL API may expose colors as floats, but there is no requirement that the internal formats of buffers and images be floats, simply that the API accept floats as input for those functions. It is very common for OpenGL implementations to store data in an implementation optimized format. Even with GL 2+ shaders that expose colors as floats, use float8 or float16 so there is no conversion.
I think that performance is getting really hammered by these conversions and they need to be removed as high in the API as possible.
@JeffM2501 thank you very much for the review and the proposed performance improvements, adding @Bigfoot71, the creator of rlsw to the discussion.
Yeah, that would definitely make more sense!
Just for some context, this wasn't a conscious technical choice (as well as some other things). I mainly wanted to get something working as quickly as possible and show that it was doable.
Since I had been inactive on it for a while towards the end, but everything was working, I confirmed that it could be merged if needed, while noting that there was still work to do.
So there are probably other silly things here and there. I should also go back and check if I commented on anything in the commits or the PR that I've since forgotten.
Anyway, thanks for pointing that out, I will do this as soon as possible and ping the PR here!
@Bigfoot71 Thank you very much for the fast answer! I just implemented the custom PLATFORM_DESKTOP_WIN32 backend for it recently and announced the software-renderer, it has quite a wide impact. Even was one of the top stories on Hacker News yesterday: https://news.ycombinator.com/item?id=45661638
Next planned steps are adding support for Web canvas output and PLATFORM_NULL for rendering to memory buffer.
Ah, stressful! I've already started thinking about all this stuff with formats and conversions.
I think I'll start by simplifying all the framebuffer related code. Jeff's idea of having textures in the same format as the destination actually seems like the best approach since the framebuffer is unique and has a unique format.
I'll open a draft PR soon to get reviews!
@Bigfoot71 Please, take it easy, no stress! 😅We already have a working version and this new addition is already huge for raylib!
Some discussion has followed on the PR, many improvements have been merged, including SIMD vectorizations with considerable peerformance gains. Also, multiplee minor issues have been fixed, code simplified and improved some visuals.
Just got raylib running on Orange Pi RV2, a RISC-V powered board, using Mesa llvmpipe software renderer, getting ~300 bunnies @ 30 fps. It could seem low but previously it was using the Mesa softpipe software renderer and getting ~10 bunnies @ 10 fps.
Note that RISC-V is still a very young CPU and not there are not many devices on the market, and also 3-4 that can actually run a Linux distro. My devices runs on a custom Ubuntu-based distro with Wayland windowing, but raylib seems to be running through X-Wayland because it is compiled for X11, maybe compiling it for Wayland could squeeze some more frames...
I also tried compiling raylib and the example for PLATFORM_DRM but despite it compiled, it crashes with a segfault after initialization, most probably because it links with Mesa llvmpipe and fails on GLSL100 shader compilation, unfortunately there is no GPU driver available for hardware acceleration (despite the board has an Imagination's entry-level IMG BXE-2-32 GPU).
As GLFW does not support a software renderer backend, I'd be nice to have a custom Wayland platform backend for proper testing of the software renderer but that's a big project out-of-scopee for now. Still, software renderer can be tested with SDL2/3 backend and with DRM backend.
As an improvement to rlsw, the board supports RISC-V RVV vector instructions extension, that allows up to 64 floating-point parallel operations on a single cycle! I'll try to add support for it to rlsw while testing! 😱
Keep working on it!
Hey! Good and bad news!
-
Good news: I got raylib with software renderer running on
PLATFORM_DRMon Orange Pi RV2, in headless mode! -
Bad news: Performance is quite bad:
0-10 bunnies @ 10 fps,~80 bunnies @ 4 fps -
Good news: Despite the bad performance, it's almost the same I got with
Mesa softpipeimplementation on desktop and I couldn't run Mesa onPLATFORM_DRM. -
Better news: Current implementation is not using RVV RISC-V vector extensions that on Orange Pi RV2 implementation could do up to 64 floating points operations in parallel, so, performance can be improved (if I manage to get it working 😄 )
Great!
- Bad news: Performance is quite bad:
0-10 bunnies @ 10 fps,~80 bunnies @ 4 fps
Quick question, was RLSW running at full screen resolution or still 800x450?
Because with DRM, even in the best case scenario, 800x450 resolution with fullscreen blit, it still comes at a non negligible cost...
Also, clearing is very expensive, in fact in most cases, it's the most costly operation
I'm thinking about a solution that would allow for a lazy clear or something similar, but in the current state, we would end up clearing everything anyway. I've considered other "lazy clear style" solutions, but they're not trivial to implement and require testing, so it will take me some time.
Quick question, was RLSW running at full screen resolution or still 800x450?
@Bigfoot71 It is running on kernel mode, so full screen, no windowing system, but as per my understanding the display take the resolution of the 800x450 framebuffer... not sure if there is an actual scaling in that mode...
About the "lazy clear" solution, I think it can increase code complexity, there could be other routes to explore before that, RVV vector instructions seems really promizing.
- Good news: I noticed SSE was not detected on MSVC and reviewed defines to get SSE2 working on x64 (probably other defines can also be reviewed). The improvements are considerable:
~2300 bunnies @ 30 fps, without SIMD I got~1000 bunnies @ 30 fps(testing with laptop battery, if connected to power numbers increase to~3600 bunnies @ 30 fps)
Oh, great!
And yes, if I remember correctly, when I tested DRM on my PC it was slow too, but I don't recall if there was any upscaling, I'd have to double-check.
Otherwise, I don't know how much more efficient explicit RVV can be in the context of clearing. I've tried several SIMD approaches for clearing, but it was never more efficient than what the compiler did in O3 with a simple loop...
But since you plan to use RVV, the smartest move would probably be to switch to barycentric rasterization, which would let us process multiple pixels at once easily.
The main issue with that is the "single header" constraint, using macros to generate specialized functions is the only way I've found to keep it reasonably contained.
But if we want SIMD-specific versions per architecture, that macro approach quickly becomes a nightmare.
I considered defining the rasterization functions in a file full of #ifdef and including it multiple times with the right predefinitions, but then it's no longer a single-header setup.
There might be a trick using a self-include, but I imagine it would get unreadable pretty quickly.
In C++ this would be trivial with templates, but I assume adding C++ to raylib is off the table.
We could also use function pointer tables for each pipeline state, but most of our current performance comes from heavy inlining, so we'd likely lose efficiency.
Out of desperation, I even tried passing explicit true/false parameters for each pipeline option, but even with the most aggressive optimization flags, it still produced runtime branches.
If anyone has a better idea than the current approach in a single header, I will bow very low!