Daemon multithread the MD5/IQM model CPU code using OpenMP

Add Omp facilities in framework
Use OpenMP to multithread the MD5/IQM model CPU code

For now it is disabled by default, one should use the -DUSE_OPENMP=ON cmake option to enable it.

In the future I plan to progressively enable it:

Expected: Enable it on Linux with GCC, we would have to modify the release validation script to accept the fact the executable depends on libgomp.so. The libgomp.so library is as standard as the glibc so it's fine.
Probable: Enable it on Windows with MinGW, we have to modify the release validation script to accept the fact the executable depends on libgomp.dll, and we would have to modify the release build script to package libgomp.dll. The libgomp.dll is provided by MSYS2 so it's fine.

I don't plan to enable it on macOS as I've heard that macOS doesn't ship the LLVM's libomp.so by default.

Such enablement will be done on later PRs.

The purpose of adding OpenMP abilities is to make it optional to speed-up operations with it, but the same operations should work without it.

Then it implements parallelization of the MD5 and IQM CPU code.

This was investigated on:

https://github.com/DaemonEngine/Daemon/pull/1833
https://github.com/DaemonEngine/Daemon/pull/1837

It uses a chunked implementation, as tests demonstrated it was the fastest one.

Using a beefy computer and enabling 16 threads I got that performance difference with the chunked implementation on the same heavy scene:

Before	After
`91fps`	`438fps`

Of course, the performance difference is expected to be lower on older CPUs usually running alongside older GPUs whose limitations enforce that CPU codepath, but it is now demonstrated that such parallelization scales well. This can move some devices from the slow to the playable category, or from the playable to the passed category.

A goo way to test that is to follow those instructions:

/set r_vboModels off
/devmap plat23
/team h; class rifle; delay 1s setviewpos 1920 1920 20 0 0

This will spawn the human player and move it to the alien base entrance, where all the IQM buildable models from the alien base will be rendered because of them being in vis, with at least two IQM animated acid tubes actually in direct sight, plus the MD5 first person rifle on foreground. Starting from that one can also shot the acide tubes and empty the rifle magazines to play additional animations from the acid tube death and the rifle first person shoot and reload.

One can test various amount of threads this way:

/set common.ompThreads 4

The 0 default value let the engine picks an amount of threads by itself, other values enforce the amount of threads.

Oct 02 '25 00:10 illwieckz

For now there is a code that is guarded by a NO_MT_IF_NO_TBNTOQ define. This is because this code doesn't use R_TBNtoQtangents() and I want to test if that other code not using R_TBNtoQtangents() is slow enough to benefit from the parallelism (it looks like it is, but I will test more).

Oct 02 '25 00:10 illwieckz

I tested 2-thread, 8-thread and 32-thread machines. With 2-thread and 8-thread machines, maxing the threads was giving more performance, while on the 32-thread machine the performance was going up when adding threads up to 16, then was slowing down when doing more than 16 threads, so I caped the automatic thread detection at 16. I assume that after 16 threads the thread management becomes too costly and destroys the benefit of dispatching the work. The cvar range allows up to 32 threads for the ones wanting to test about it.

Oct 02 '25 01:10 illwieckz

I tested 2-thread, 8-thread and 32-thread machines. With 2-thread and 8-thread machines, maxing the threads was giving more performance.

Humm, no, with 8 threads it performs better with 6 threads, I'll add a more complex heuristic then.

Oct 02 '25 01:10 illwieckz

For now there is a code that is guarded by a NO_MT_IF_NO_TBNTOQ define. This is because this code doesn't use R_TBNtoQtangents() and I want to test if that other code not using R_TBNtoQtangents() is slow enough to benefit from the parallelism (it looks like it is, but I will test more).

On a machine with 8 cores and only running the game so the framerate is more stable, with the parallel code for that part I get 85 fps, with the legacy code I get 80 fps. That confirms what I have observed on my main machine (430 fps vs 410 fps, where framerate was much unstable due to other applications running around, so the doubt was allowed. The win isn't that big on that part, but that's measurable.

I'll drop the legacy sequential code for that part as well.

Oct 02 '25 01:10 illwieckz

On a machine with 8 cores and only running the game so the framerate is more stable, with the parallel code for that part I get 85 fps, with the legacy code I get 80 fps. That confirms what I have observed on my main machine (430 fps vs 410 fps, where framerate was much unstable due to other applications running around, so the doubt was allowed. The win isn't that big on that part, but that's measurable.

I'll drop the legacy sequential code for that part as well.

Well, no, I still had a doubt, so I used some hud to draw a framerate curve, and added a cvar to switch the code, and this doesn't change anything. One problem is that I probably don't run that code at all.

And I added a logger, it never prints anything. Anyway the code isn't as heavy as R_TBNtoQtangents() but isn't cheap, I'll probably keep the parallelized code.

Oct 02 '25 02:10 illwieckz

I noticed something very interesting on that 8-core machine, which is a laptop. By default without the threading it does 65fps. but the CPU isn't maxing the temperature. At the moment I enable the threading, the performances jumps to 140fps but then the temperature is maxed and then the performance slowly decreases until it reaches 85fps where it keeps this framerate (and the temperature isn't maxed anymore).

Oct 02 '25 02:10 illwieckz

Using that same 8-core laptop, when using the powersave governor to make sure the CPU doesn't throttle due to temperature (and is already on the lowest frequency anyway), when enabling the parallelism it switches from 1 thread to 6 threads and the performance jumps from a stable 16fps to a sable 40fps, which is exactly a 2.5× boost, that's good! And the temperature remains the same.

Oct 02 '25 02:10 illwieckz

While I was at it, I parallelized some parts of the MD3, MD5 and IQM loading code as well. The parallelization of the MD3 loading code is a bit noisy on the diff size because unlike the MD5 and IQM code that I cleaned-up long time ago, the MD3 code was full of reused “global to functions” variables that would just create race conditions once the code is parallelized.

Oct 02 '25 05:10 illwieckz

Now that I think about it, it's probably possible to template the chunking as well.

Oct 02 '25 16:10 illwieckz

The latest version just uses OMP as a basic thread pool. You can find various simple thread pool implementations that are just a couple hundred lines of code, so I will try hooking up the code to one of those to see if we can drop the dependency. I bet the problem with your non-library-using chunked implementation was just that it spends too much time creating and destroying threads, which is solved by a thread pool.

Oct 02 '25 19:10 slipher

On the other hand, if you did take the OMP dependency, I imagine you would get somewhat better results with less code, by just putting the #pragma omp parallel for directly on the loops you want to parallelize, instead of dividing it into chunks. That should be more efficient than the original STL-style foreach because it presumably avoids function pointer overhead.

Oct 02 '25 19:10 slipher

I bet the problem with your non-library-using chunked implementation was just that it spends too much time creating and destroying threads, which is solved by a thread pool.

Yes, this is a very bad solution for various problems (along with the fact that not reusing the threads makes profiling a nightmare).

On the other hand, if you did take the OMP dependency, I imagine you would get somewhat better results with less code, by just putting the #pragma omp parallel for directly on the loops you want to parallelize, instead of dividing it into chunks.

I may have not tested yet the non-chunked code with the pragma indeed. I don't know what magic OMP does behind the scene when adding such pragma, but I found interesting that using the pragma on the chunked version was faster than the gnu parallel foreach call which did more guessing than the pragma on the for loop.

Oct 02 '25 20:10 illwieckz

by just putting the #pragma omp parallel for directly on the loops you want to parallelize, instead of dividing it into chunks.

We may not exclude the fact that when chunked, some other optimizations may be findable by the compiler, like fusing some iterations with SIMD calls, though regarding the IQM code, it's already full of vectors and we only require SSE2 so it probably cannot fuse iterations.

Oct 02 '25 20:10 illwieckz

The latest version just uses OMP as a basic thread pool. You can find various simple thread pool implementations that are just a couple hundred lines of code, so I will try hooking up the code to one of those to see if we can drop the dependency.

That's welcome, but if for some reasons it happens that other implementations don't perform as well as OMP, the libgomp isn't really an annoying dependency on both Linux and MSYS2.

Oct 02 '25 20:10 illwieckz

I may have not tested yet the non-chunked code with the pragma indeed.

So I just somewhat tested it by using the amount of vertex as number of chunks, with a chunk of size 1.

I have hard time to see a difference on my workstation using 16-threads, but that's because I have other things running alongside, both (chunked or not) currently run at 400~410 fps.

On my 8-thread laptop I see a small difference on the fact with the chunked implementation it sometime tops at 183 fps while with the non-chunked implementation it doesn't top higher than 179 fps, I reproduced this multiple times.

Oct 02 '25 20:10 illwieckz

I unchunked the code.

If we want to investigate chunking, we can do it later, and if we do it we should do it in the template instead.

Oct 02 '25 21:10 illwieckz

Doing that simplified the code and I finally topped at 182 fps on the 8-thread laptop with the unchunked code. So I guess we don't have to care about chunking it.

Oct 02 '25 21:10 illwieckz

I tried the compiler's built-in loop parallelization with #pragma omp parallel for -- see the slipher/omp-for branch. This gives me a measurable performance boost over the lambda-based dispatch.

Oct 02 '25 23:10 slipher

Also I removed the load-time OMP commits from that branch. I don't think that's worthwhile because ~97% of model loading time is spent on textures; the vertex data is hardly worth optimizing. We should try to avoid incurring costs of OMP when we are not actually going to use it. If we have fully GPU-based vertex skinning, we shouldn't start up the threads. And we shouldn't link OMP into the server which doesn't use it.

Oct 02 '25 23:10 slipher

I decided not to bother trying the thread pool since the pragma-based approach with OMP actually seems the least intrusive: that way, there are no lambdas which would make the single-threaded version less efficient. And the amount of extra code is minimal. Also MSVC supposedly implements OMP, so I will try that later.

Oct 02 '25 23:10 slipher

Ah yes, since I don't chunk anymore, we don't need a lambda anymore as well.

Though, you're not setting the thread count before running the loop, and I noticed that when not setting them right before running the loop, the amount of threads being used is unpredictable.

Oct 02 '25 23:10 illwieckz

Though, you're not setting the thread count before running the loop, and I noticed that when not setting them right before running the loop, the amount of threads being used is unpredictable.

Changing the number of threads at runtime wouldn't work yet on my branch, but Omp::Init called on startup does set the number of threads, so it should work fine as long as you don't toggle the cvars.

Oct 02 '25 23:10 slipher

I don't know if that's related, but with r_smp it's unpredictable. Even when setting it at the start of each frame, this isn't enough.

Oct 03 '25 00:10 illwieckz

How are you determining that "the amount of threads being used is unpredictable"? It makes sense that turning on r_smp would throw off timing measurements by having another thread unpredictably running at the same time. So don't do that!

Oct 03 '25 00:10 slipher

By printing the output of omp_get_num_threads() and also by looking at the amount of busy threads in htop.

In my previous experiments I got very weird things, like omp_get_num_threads() returning 2 when I've set 16, etc.

Oct 03 '25 00:10 illwieckz

I removed the if (BUILD_SERVER), etc. from the cmake file, I also removed the if (NOT BUILD_CGAME), etc. from it because I guess it would prevent to use OpenMP when building native games in the same cmake build as the engine, as we don't build native games in subprojects.

Oct 03 '25 00:10 illwieckz

I also removed the commits parallelizing loading stuff, that can be discussed later.

Oct 03 '25 00:10 illwieckz

On the 8-thread laptop I now top at 185fps, and the frametime curve is now much more smooth, and the throttling starts later and the framerate slow down due to throttling is going down more slowly (it keeps the higher framerates much longer).

Oct 03 '25 00:10 illwieckz

I decided not to bother trying the thread pool since the pragma-based approach with OMP actually seems the least intrusive.

Yes, if we can use OMP that would be very good, it's very easy to integrate in our code, and the code just builds without problem when OMP is missing.

Oct 03 '25 00:10 illwieckz

Just as a test I commented out the EnlistThreads() calls and then the engine spawns 32 threads and the framerate is 1fps.

Oct 03 '25 00:10 illwieckz