FiberTaskingLib
FiberTaskingLib copied to clipboard
Huge amount of tasks waiting on a single counter.
Continuing to improve my rasterizer, im trying to get it to parallelize better and safer. I create a LOT of "vertex shader" tasks, wich act as producers (like 100 of them) and they store the triangles in a tiled screenspace data structure. This data structure has a lot of "tile" objects, each of them with a concurrent queue for its triangles. When most of the triangle tasks have finished, i launch the pixel shader tasks, one per tile, wich take care of the rasterization. I launch one task per tile so it fills the 16 threads on my ryzen nicely (about 200 tasks).
Before, i would launch said tasks twice. First the launch in the middle of the vertex tasks, and then a second launch once all the vertex tasks have finished their work. In that architecture, i would have 2 hard sync points, wich dont use the CPU effectively due to having to wait for all tasks to finish. One sync point per tile rendering batch. This gets me to about 70% cpu usage, due to the syncronization, and i wanted to improve that.
To improve that, i decided to launch only 1 task per tile instead of 2, but have the tasks wait on a counter if they empty the vertex queues. The atomic counter will be 1 while the vertex tasks are still doing work, but once the vertex tasks end, the atomic is set to 0. The idea is that if a tile render task ends its queue but the vertex shaders still havent finished (they might add more vertices to the queue) then the task just goes into waiting until all the vertex tasks have finished.
As that will end up with around 200 tasks waiting on a single counter, of course it doesnt work.
To make it work i tried to increase NUM_WAITING_FIBER_SLOTS from 4 to 256, and the system actually works now. While it actually works, its a massive hack and increases memory usage and doesnt get an speed improvement over the naive version before, in fact it goes a bit slower.
This is how a debug profile looks, you can see how the blue tasks start before the pink tasks have finished, and also you can see the white tasks (tile finish) only after the pink tasks end.
Comparing the profile with profiles before the new system, it looks like there is extra overhead when switching tasks or waiting (but not sure).
Is there a better way to do this ?
Not sure if I understood everything, but did you try having one task blocked (waiting) on the vertex_shaders_finished
counter, and having that one task spawn all your pixel shader tasks? Otherwise this profile looks close to optimal to me, given that you will have a lull after your vertex tasks, since you have a sync point.
@martty if i wait for the vertex shader tasks to finish before launching the pixel shader tasks, then i have a sync point where all the threads are waiting for the last vertex shader task to finish, i want to avoid that. In that profile i actually launch the tasks before the vertex shaders are finished (you can see how blue starts before pink finishes)
Then I don't understand :). I was under the impression that you can only start shading after you are done with the vertex shaders? If you don't need a global barrier (eg. this restriction is only per tile), then why not launch the corresponding PS task from the VS task?
Closed by mistake.
It could probably work if i add a single atomic counter per tile, thus tasks only wait 1 per counter (but then i have a lot of counters), instead of 200 tasks waiting on a single counter.
How about making the queuing and syncing data driven?
I assume the output of the vertex shaders are triangles in Normalized Device Coordinate Space. And they're bucketed into screenspace tiles. Create a single counter per tile. When inserting triangles into the bucket, if there is enough triangles for a Pixel shader wave invocation, launch a Task and increment the respective counter. When it finishes, decrement.
When you finish all the vertex shaders, "drain" the tile buckets by launching non-full pixel shader invocations. Increment per tile.
In a for loop, wait on all the tiles.
In this way, the only sync point is at the end of the draw call.
Unrelated to the core syncing problem, we could have the counters allocate if you have too many things waiting on them so that aren't hard limited or have to waste tons of space. The value that you set in the macro will just be the size it takes if it doesn't allocate.
Thanks @RichieSams, you are right on the design. Will have a look at it. That sounds interesting.
@cwfitzgerald How would you protect the allocation from multiple thread access?
@RichieSams Personally I've been thinking a lot about atomic stack allocators. They are fairly simple as atomic thingamajigies go, and fit the job well. Common access pattern is to wait for zero, so almost all will clear at the same time.
This should be solved with the new waiting system in v2.0.0