Proposal for parallelizing the image loading
Some image decoding can be slow, PNG is slow, and Crunch is a russian doll of DXT compressed data itself compressed with LZMA, we don't decode the DXT data, but we do decode the LZMA, and that isn't fast. Other codecs like WebP or JPEG are likely much faster than PNG, but aren't cheap anyway.
With our current architecture, parallelizing the image loading isn't easy. Especially, we cannot parallelize the upload of the image to the GPU I guess.
Our code is very sequential: we parse a shader stage, we load an image from the VFS, we decompress what we have to decompress, we upload the data, and then we jump to the next shader stage.
While working on parallelizing some model code we experimented with OpenMP:
- https://github.com/DaemonEngine/Daemon/pull/1838
That expriment shown that a simple loop that read input as an array and writes output as another array with the same index is very easy to parallelize, just by adding a pragma, and that it will work with our without OpenMP support, and that the parallelization will be very efficient once the OpenMP support is there.
So, to fetch low hanging fruits, we should look at code that looks like that pattern, or that can be transformed to look like that pattern.
Something we can parallelize is the decoding of images: put all the image names in an array as input, and provide an array of memory areas suitable to receive pixmap data as output, write the loop, annotate it with some OpenMP pragma and that's done.
For now we can't really do that for all images from a shader, or for all images from a map, because we iterate shader stages one by one and once a shader stage is parsed, we don't decode images anymore, and it wouldn't be easy to change that.
What looks easier to achieve is to parallelize the loading of all images from a single shader stage. The good news is that we implemented pre-collapsed shaders a long time ago now:
- https://github.com/DaemonEngine/Daemon/pull/230
For now we load all textures from a single stage one by one, as soon as we face one.
But then, for being able to configure image upload based on blend function, which require to configure image loading after the whole shader stage is parsed, I implemented a prototype of delayed pre-collapsed image loading:
- https://github.com/DaemonEngine/Daemon/pull/1855
This delayed pre-collapsed image loading already works. Every time a texture like a diffuse map or a normal map and things like that are found in a shader stage, such texture path is added to an array of texture paths, and once the whole shader stage is parsed, all the images are loaded in a loop.
If we split the code decoding the image and the code uploading the image, we can probably split the loop and have a loop decoding the images and have a loop uploading the images, meaning we would be able to parallelize the decoding of the images.
If we can do that, then a pre-collapsed stage with, for example, 5 textures: a diffuse map, a normal map, a height map, a specular map and a glow map, could decode all those 5 textures at once, 1 texture per thread, in parallel, reducing the decoding time from 5 image decoding time to the longest single image decoding time of the 5.
Then we would upload all the decoded images from that stage, sequentially.
If we can do that, then a pre-collapsed stage with, for example, 5 textures: a diffuse map, a normal map, a height map, and specular map and a glow map, could decode all those 5 textures at once, 1 texture per thread, in parallel, reducing the decoding time from 5 image decoding time to the longes single image decoding time of the 5.
This would get us some gains, but we could get more by delaying image loading to EndRegistration (or the end of the frame, etc.). Then we would get the full advantage of however much parallelism we have, unless/until we saturate the ability to upload images to the GPU.
The biggest complication I see is handling when/whether the shader is invalidated upon image loading failure. For example when a normal map fails to load, does that invalidate the whole shader or just disable normal mapping? I don't know the answer off the top of my head. It might be tricky to make things behave identically in the parallelized case. But the exact semantics of failure probably don't matter that much so we could just change things to make it easier if necessary. Also, a shader registered by the cgame that fails to load currently returns a 0 handle. With delayed parallel loading we would have to return a nonzero handle with the assumption that the images will successfully load. Not really a big deal though.
The OpenMP model doesn't seem like a perfect fit for this task; it's more of an async or thread pool-shaped workflow. I think the ideal pattern is the same one that's used for navmesh generation: main thread handles I/O (+ other work if I/O is done) and the others handle only data work. In this case OpenGL calls are the "I/O". But we could maybe hack it with OpenMP by doing the OpenGL uploading on any thread and just wrap it in a critical section directive. Though I wonder how exceptions work in OpenMP?
I would start by implementing this for .crn only, since legacy map images tend to be low-resolution and the other decoders are probably better optimized.
I assume it wouldn't be that hard for per-stage images to be parallelized, and probably for per-shader images. For the later what is more complex is that the amount of images per shaders is less predictable than the amount of ones per collapsed stages.
As far as I know we do validation and set fallbacks once the shader is fully parsed, so if I'm right missing images wouldn't be a problem.
There are situations where we need to parse the image for configuring the shader stage, like the Xonotic height map detection in normal map, but this doesn't affect our game. In the same way in my delayed image loading prototype I haven't implemented the delayed loading for the legacy doom3 like stages, since they are whole stages by themselves.
The idea is to parallelize the best scenario we recommend, and as far as I know, we don't do detection. And if we need detection for something complex, that can be an exception, allowing us to parallelize the rest.
As a side note, another benefit of delayed loading is that we can flag the image for the missing of alpha channel on the fact there is no blending operation, and skip the existing detection. That delayed image loading has actually three known benefit:
- make possible to configure naive/linear blending on a stage and get it applied for the texture even when the blend keyword is written after the texture keyword
- make possible to flag an image for being RGB instead of RGBA without looking at the actual data, yet again by knowing the lack of blending operation before the texture is actually loaded, whatever the ordering of shader stage instructions
- maybe parallelize some decoding texture tasks because that delay allows us to have a nice decoding loop at the end of the stage parsing
Edit: I remember having experienced the need for delayed loading before, so there may be other benefits to list.
I would start by implementing this for .crn only, since legacy map images tend to be low-resolution and the other decoders are probably better optimized.
Well, if we split LoadMap() into DecodeMap() and UploadMap() that would benefit all formats out of the box.
- make possible to configure naive/linear blending on a stage and get it applied for the texture even when the blend keyword is written after the texture keyword
Yeah that would be nice. I think even #1721 would benefit from that, although using different images would probably be rare.
- make possible to flag an image for being RGB instead of RGBA without looking at the actual data, yet again by knowing the lack of blending operation before the texture is actually loaded, whatever the ordering of shader stage instructions
This is probably a bad idea since an image can be used in multiple shaders, with different blend functions.
We likely can also parallelize animMap frame loading.
We likely can also parallelize 6-faces skybox face loading.