xls icon indicating copy to clipboard operation
xls copied to clipboard

Outlining pass

Open taktoa opened this issue 2 years ago • 3 comments

For II > 1 we'll need the ability to merge together entire blocks of code, not just single nodes (as we do in the current mutual_exclusion_pass). To do this, we should outline (opposite of inlining) functions from the code before running the mutual_exclusion_pass, and then inline them afterward.

First, we'll define antiunification to be a process in which we match two nodes against each other as deeply as possible. For example, the antiunification of x + (5 * y) and 3 + (4 * (p + q)) would be a + (b * c). The technical definition is that the antiunification is the largest term that can be instantiated to yield each of the given terms. We can swap the operands of commutative ops and reassociate associative ops to make the antiunification bigger, though this is not essential. Antiunification is itself commutative and associative, so the antiunification of a set of nodes is the same no matter how you decompose it into pairwise applications of antiunification.

The collision-avoiding antiunification of two nodes in a dataflow graph is the largest pair of matching subgraphs such that the two subgraphs are nonoverlapping. For example, if you had x = a + b * c; y = d + e * x; z = f + g * y;, then the antiunification of y with z would be _ + (_ * (_ + _ * _)), but the collision-avoiding antiunification would be _ + _ * _, because the two subgraphs that are shaped like _ + (_ * (_ + _ * _)) overlap. From here on, when we refer to antiunification, assume that it is collision-avoiding.

In our outlining pass, we will repeatedly (greedily) try to outline the best possible single function for a given dataflow graph. To do this, we will first compute the antiunification of every pair of nodes in the dataflow graph, cataloging the antiunifiers in a table of quadratic size. This table can be computed using dynamic programming. If two nodes have no antiunifier, this fact is also cataloged in the table. Then, our goal is to choose a subset of size n of nodes that when antiunified generate a term of size k, such that n * k is maximized. If we consider the table of antiunifiers to be a graph where edges are weighted by the size of the pairwise antiunifier of two nodes if it exists, then we want to choose a clique such that the minimum edge weight in the clique is maximized. This may be a simplification, though I can't come up with an example where it would perform worse.

This problem can be greedily solved in the following way. First, remove the lowest weight edges until the graph is a disjoint union of cliques (which can be recognized by checking if the complement of each connected component contains no edges). Then execute the following algorithm on each clique:

// Take all the minimum edge weight edges and remove one of their endpoints.
// The endpoint chosen is based on the following rule: if at least one of the endpoints has been
// removed already, skip that edge, otherwise remove the node that has the smaller minimum
// incident edge weight.
// This function returns the optimal value; returning the optimal clique only requires small changes.
int64_t AblateClique(absl::flat_hash_set<Node*> clique, std::function<int64_t(Node*, Node*)> edge_weight) {
  // Find all edges whose edge weight is equal to the minimum edge weight in the clique
  absl::flat_hash_set<std::pair<Node*, Node*>> minimum_edge_weight_edges = ...;
  for (const auto& [a, b] : minimum_edge_weight_edges) {
    if (!clique.contains(a) || !clique.contains(b)) { continue; }
    int64_t minimum_incident_edge_weight_a = 0, minimum_incident_edge_weight_b = 0;
    for (Node* node : clique) {
      if ((node == a) || (node == b)) { continue; }
      minimum_incident_edge_weight_a = std::min(minimum_incident_edge_weight_a, edge_weight(node, a));
      minimum_incident_edge_weight_b = std::min(minimum_incident_edge_weight_b, edge_weight(node, b));
    }
    if (minimum_incident_edge_weight_a < minimum_incident_edge_weight_b) {
      clique.erase(a);
    } else {
      clique.erase(b);
    }
  }
  int64_t number_of_nodes_saved = (clique.size() - 1) * Antiunify(clique).size();
  return std::max(number_of_nodes_saved, AblateClique(clique, edge_weight));
}

cc @meheff

taktoa avatar Jul 20 '22 20:07 taktoa

Interesting! It'd be a good exercise to run such an algorithm on our benchmarks and examples in which we might know there is benefit to outlining and see what pops out. I wonder if this approach could be used in a proc rerolling scheme where outlining finds the reuse opportunities and rerolling of the proc reduces the number of outlined instances to one where some state on the proc determines which inputs get muxed into the single outlined instance in which iteration. This effectively encodes the resuse state machine explicitly into the proc.

Few comments/questions:

  • for the collision avoiding antiunifcation example, do you mean that the antiunification of y and z (not x and y) is _ + (_ * (_ + _ * _)) because I don't see how you get that large patter from x.

  • Is anything special done for matching among commutative operations? I can imagine trying both orderings and picking the maximal match but that might be exponential (maybe not with dynamic programming?), or perhaps performing some canonicalization ahead of time.

  • Rather than using antiunification expression size k in the objective function we may want to weight the size by the cost (in area) of the circuit to compute the expression. To account for the cost of muxing inputs you could subtract from the cost some value correlated with width of the variable inputs. This gives you some net benefit for outlining the expression and would focus the algorithm on large area, small input expressions which provide the most benefit. I guess the unknown is whether this objective function has the right properties such that the algorithm or one similar to it finds you an optimal solution.

meheff avatar Aug 02 '22 16:08 meheff

Yep, that was a typo.

For associative/commutative operations, I think canonicalizing first will be our best bet. Something like "sort the children by lexicographic order in the preorder of the tree below them" would work decently. There is a literature on matching/unification modulo theories (and specifically I know AC-matching and AC-unification are well studied), but I believe even just matching reduces down to solving nonnegative linear diophantine systems which is NP-complete. There are probably some decent heuristics though. I'd need to see if any of this stuff exists for antiunification.

I thought about weighting the size by cost, but the connection between the edge weights and optimality is quite subtle so I think that will require a lot more thought/work.

taktoa avatar Aug 04 '22 20:08 taktoa

Mark pointed out that the "First, remove the lowest weight edges until the graph is a disjoint union of cliques (which can be recognized by checking if the complement of each connected component contains no edges)." step is unnecessary, because the property of having antiunifier size greater than 0 is transitive, so the graph is already a disjoint union of cliques.

Also, I realized that the problem AblateClique is trying to solve is repeatedly finding the maximum clique in the graph, then removing all minimum weight edges from that graph, and taking the maximum # of nodes * minimum weight edge product from that set of cliques. So it may help to use a proper maximum clique finding algorithm instead of the greedy solution described here.

taktoa avatar Aug 12 '22 04:08 taktoa