wasmtime icon indicating copy to clipboard operation
wasmtime copied to clipboard

cranelift: Better heuristics for instruction scheduling

Open jameysharp opened this issue 1 year ago • 4 comments

Feature

When any optimization level is enabled, Cranelift runs e-graph equality saturation, which removes all pure instructions from the function layout and leaves only a data-flow graph representation. We then have to put those instructions back into a linear order before we can emit machine code.

Currently, we pick an arbitrary topo-sort of the instructions. Instead, I propose implementing list scheduling and, when there are multiple instructions whose operands are all ready, using a heuristic to decide between them.

Benefit

See #6159 for one in-depth analysis of a workload where our instruction scheduling has a measurable impact on performance.

Implementation

I suggest a combination of two heuristics, based on the results in this paper:

Shobaki, G., Sakka, L., Abu Rmaileh, N. E., and Al-Hamash, H. (2015) Experimental evaluation of various register-pressure-reduction heuristics. Softw. Pract. Exper., 45: 1497– 1517. doi: 10.1002/spe.2297

The first heuristic is called "Last Use Count" (LUC). It counts how many live-ranges will end if we place this instruction now. Put another way, for each operand, check whether there are any other instructions that will use the same value but haven't been scheduled yet. If there aren't any, then this instruction is the last user of that value. Count how many operands that applies to. We want to choose an instruction with maximum LUC at each point, because that tends to reduce register pressure.

LUC should have frequent ties. Since most instructions have at most two operands, they can end zero, one, or two live ranges. So we need a tie-breaker.

The tie-breaker heuristic is "Critical Path" (CP) length. It counts the maximum length of any data-dependent chain of instructions that must execute after this instruction before the end of the block. By itself, CP optimizes for instruction-level parallelism (ILP) by starting long chains of operations as soon as possible, but it tends to increase register pressure. However, when used as a tie-breaker for LUC, it can reduce register pressure more than LUC alone. We want to choose an instruction with maximum CP.

If both LUC and CP are tied, we need a last-resort tie-breaker. Prefer the instruction with the smallest instruction number. That will tend to keep instructions in the same order that they appeared in the input.

Computing CP for each instruction can be done in linear time with a breadth-first traversal starting from the block terminator. If there's an edge from instruction A to instruction B, then the CP length for B is whichever is higher of: 1) A's CP length plus 1, or 2) any CP length from any other path. Note that each instruction in the side-effecting skeleton should be treated as if it has an edge to the side-effecting instruction immediately before it.

While visiting all the instructions in the block we should also record the reverse edges: the set of instructions which use each value. Also initialize a side table recording, for each instruction, the number of values it uses which are defined by instructions inside the current block.

The main scheduling loop should proceed like this:

  • Pick the highest-priority instruction that's currently ready and append it to the current block's layout.
  • For each operand of this instruction, remove this instruction from the set of the value's users. If the value now has exactly one user left, find that user and increment its LUC.
  • For each result defined by this instruction, find all instructions which use the value and decrement their count of unready operands. If any now has zero unready operands, add it to the priority queue of ready instructions.
  • If this instruction was in the side-effecting skeleton, check the next side-effecting instruction to see if it's ready. Add it to the queue if so.

Because the LUC can increase while an instruction is waiting in the ready queue, we need a priority queue which supports an efficient "decrease-key" operation. (Textbooks and academic papers generally assume you want to find/extract the minimum element, so we need to reverse the order.) "Rank-pairing heaps" (by Haeupler, Sen, and Tarjan) are the newest option I found in a literature review, and they cite all the earlier papers that I found, so that paper is a good starting point for picking a suitable data structure. It should be possible to get O(1) time for all the operations we need except for extract-min, which should be O(log n); but depending on the data structure chosen these bounds will either be worst-case or amortized time.

I think the overall running time of this algorithm is then O(n log n) in the number of instructions in the block. (There should maybe be a factor for the number of edges as well, but we can treat that as bounded by a constant factor of the number of instructions. Again, for most instructions that bound is 2.)

Alternatives

@cfallin suggested a variation on this idea, where we use some other heuristic to decide whether it's worth spending extra CPU time on heuristic-guided list scheduling. We can always fall back to the current topo-sort. One heuristic for this could be some function of the critical path length of the block as a whole (that is, the maximum CP for any instruction) versus the number of instructions in the block, which is one way of estimating the maximum amount of parallelism available in the block. If there's very little parallelism then there aren't many choices of topo-sort and they're probably all pretty close to equivalent.

Critical path length could be defined in terms of an estimate of the number of CPU cycles needed to evaluate the instruction, rather than assuming that all CLIF instructions take the same amount of time. This is probably not worth bothering with but, given a suitable cost model, it's easy to integrate by using a priority queue instead of a FIFO queue when computing CP length.

There are no doubt other ordering heuristics we could try.

jameysharp avatar Apr 21 '23 01:04 jameysharp