Nabla icon indicating copy to clipboard operation
Nabla copied to clipboard

Material Compiler Optimizations

Open devshgraphicsprogramming opened this issue 3 years ago • 5 comments

Description

Use a greedy algorithm to determine register allocation and recycling.

Description of the related problem

Right now the register allocation has remnants of the old approach of allocating registers from different heaps.

~~It would be best if we reversed the order of traversal passes, first textures, then bumpmaps, then remainder and pdf.~~ Forget what I said, whole system requires that remainder_and_pdf last instruction stores the output in register 0.

The main problem is that when deciding to emit an instruction, we do it from a post-order traversal of the IR. Such simple heuristic could lead to pathological register usage cases (think of the different ways we could codegen for mix of K BxDFs translated into a K-1 depth binary tree).

Sidequest 1

Make sure that derivative map textures get fetched as part of the texture prefetch stream.

Sidequest 2

Make sure identical textures only get fetched once (deduplication of IR texture fetch nodes).

But also make sure derivative map fetches dont stay around (fetch them last, and replace registers with precomputed normal data), but only if the same texture isnt used for something else.

Solution proposal

Remember that "registers" consumed by texture and derivativemap prefetch need to stay persistent (because fetching the texture data is amortized for multiple raygen samples per pixel).

Additional context

Probably good to implement "common subexpression" elimination while translating the IR to the canonical 2-tree form, first.

It would be really nice to deduplicate all instruction streams' parameters.

https://docs.google.com/presentation/d/1jT97cDCiIu9_AiDxKjlqHHC2n0nlHAev7RVt5Ncb9uY/edit?usp=sharing

Open Question

If we make sure that if the tree is unbalanced, and either:

  • DFS Post Order Traverse, but first reorder the children such that the child with "largest" subtree is "leftmost" (pushed first onto the DSF stack) in the child array.
  • Adaptively DFS Traverse and visit children with "largest" subtrees

How to measure "largest":

  • total nodes?
  • leaf nodes?
  • depth?

Register Compression

Right now we use between 3 and 11 DWORDS for the return values of GLSL VM's instructions.

We currently dont have any access to profiling tools like NSight or AMD shader/pipeline statistics to see the impact of occupancy under OpenGL. Optimization only makes sense if we can profile the results, so postponed until Vulkan port.

To embed BSDF data parameters in Instruction Stream or Not?

Would necessitate either a padded or variable length instruction set, but removes a dependent load at the cost of less cache friendliness.

BSDF data parameters seem to be generated explicitly for each instruction because whenever they sample from a texture the BSDF data struct points to registers, not texture handles.

Conclusion

Register Compression and BSDF data embedding probably need a dedicated synthetic performance test of material compiler.

Need to revisit.