heir icon indicating copy to clipboard operation
heir copied to clipboard

Enhance layout optimization cost model to include multiplicative depth

Open bon-cdp opened this issue 2 months ago • 4 comments

Summary

Enhance the layout optimization cost model to track multiplicative depth in addition to rotation count, enabling more accurate kernel selection for layout optimization decisions.

Background

PR #2347 implemented a DAG-based cost model for the layout optimization pass that counts rotations using symbolic execution with CSE. The cost model returns a scalar Cost representing the number of rotations required by a kernel (e.g., O(√n) for baby-step giant-step matrix-vector kernels).

While rotation count is a critical metric, multiplicative depth is another important factor in FHE performance. As discussed in #2347, future enhancements should track both metrics to make better-informed kernel selection decisions.

Proposed Enhancement

  1. Track multiplicative depth alongside rotation count by analyzing the ArithmeticDag structure
  2. Return a multi-dimensional cost structure containing both rotation count and multiplicative depth
  3. Develop a strategy for comparing and combining these metrics when selecting optimal kernels in computeHoistingOptions()

Related

  • #2316: Original issue for adding proper cost model (rotation counting implemented in #2347)

bon-cdp avatar Oct 28 '25 14:10 bon-cdp

This issue has 1 outstanding TODOs:

This comment was autogenerated by todo-backlinks

github-actions[bot] avatar Oct 31 '25 03:10 github-actions[bot]

This week I plan to read PIM a bit and some papers to implement a smart solution here. Any advice would be greatly appreciated (and I'll certainly take my time to read and learn as much as I can along the way so thanks in advance for your patience).

bon-cdp avatar Nov 02 '25 03:11 bon-cdp

A perfect solution would be to have, for each backend target, a relative latency cost for each scheme operation that is derived from measured microbenchmarks of the backend target. I.e., a cost for mul, mul_plain, add, rotate, etc., in each of lattigo, openfhe, jaxite, etc. We would likely also need to factor in something besides just latency, since the memory overhead of having many rotation keys is a factor, and reducing that by chaining together (say) power-of-2 rotation keys or using a key-management runtime would all have an impact on both memory usage and latency. There is also some degree to which these latencies will depend on configuration options provided to the backend (e.g., running OpenFHE with OpenMP, or parallelism inherent in GPU backends), and so we would likely want to have a cost model that doesn't merely sum per-operation costs (which would correspond to running a circuit serially on a single-threaded CPU), but is aware of parallelism and can simulate dispatching operations in parallel.

That is a LOT to do, and I wouldn't expect it all to be doable in the near future. Such a cost model would also be useful beyond just layout optimization, so I expect this to be something that grows over time with HEIR. And I am not an expert in this topic, the classical compiler world surely has a wealth of knowledge to draw from here. So the question is: what is a more limited-scope enhancement that we can do today that will be flexible enough to expand as these additional factors come into play?

The first obstacle is getting reasonable measured op latencies for library API backends. I know some academic groups that have done basic versions of this for some backends (the one I know of is for lattigo), but I don't know of any that are public. I can ask around. Whatever solution we have here, it would be prudent to put the measurement code in this repository so we can re-run them on the same machine with the same config over time.

But supposing we had such numbers for each op, we could analyze the DAG by applying a classical scheduling heuristic. We could look around at what compilers use for multithread scheduling, but simulating a dynamic dispatch scheduler seems like an easy starting point:

  1. Maintain an assignment of K available threads to the operations they are processing, and a set of completed operations.
  2. Put all the ops into a queue (toposorted)
  3. While some op still needs to be computed:
  4. Pop from the queue until you find an op for which all its operands have been computed already, and then assign that to the first available thread.
  5. When all threads are full, "finish" the thread that has the operation with the lowest remaining latency, mark it as complete and mark the thread as available, and then decrement the remaining latencies of each other job appropriately
  6. When all jobs are complete, return the total latency elapsed as the final cost of the circuit.

With this in mind, we can have a simple way to separate these concerns:

  1. An interface for the cost model itself, mapping operation name to a list of metrics, so that the scheduler doesn't need to know the specifics of how the costs are computed.
  2. An interface for the cost model, so that multiple types of cost models can be swapped interchangeably, and the configuration of the cost model can be hidden from the evaluation of the cost (and only known at the start of the pass when the pass flags are converted to a concrete instance + config of the cost model and associated scheduler). It would have to be moved to a member variable of LayoutOptimization, and set during runOnOperation (and the current RotationCountVisitor would need to be wrapped in a class that implements the cost model interface if it is going to continue to be an option for a cost model).

j2kun avatar Nov 02 '25 04:11 j2kun

This appears to be a script that can be used to generate a cost model for Lattigo on CPU: https://github.com/applexi/lattigo/blob/main/benchmarks/cost_model_gen.go

j2kun avatar Nov 03 '25 03:11 j2kun