opt_einsum icon indicating copy to clipboard operation
opt_einsum copied to clipboard

Suboptimal scaling for expressions of constant treewidth

Open yaroslavvb opened this issue 4 years ago • 12 comments

I'm seeing opt_einsum generate suboptimal schedules for einsum graphs of small treewidth

For instance, family of Apollonian networks are treewidth=3 graphs where optimal decomposition is discovered by greedily triangulating the graph using minfill heuristic. The optimal scaling order is n^4 regardless of graph size.

Here's an example graph and its tree decomposition (generated with this code) app3

This problem corresponds to the following einsum: BC,BD,BE,BF,BG,BI,BJ,BL,BM,CD,CE,CF,CH,CI,CK,CO,CP,DE,DG,DH,DL,DN,DO,DQ,EF,EG,EH,EJ,EK,EM,EN,EP,EQ,FI,FJ,FK,GL,GM,GN,HO,HP,HQ->

Default optimizer gives n^9 scaling. optimizer=dp recovers n^4 but seems a bit slow and doesn't work for next largest graph, where default contraction order is O(n^20)

app4

Even larger example is Apollonian-6 with 1095 edges where opt_einsum's optimize=auto gives O(n^60) contraction order -- colab

yaroslavvb avatar Dec 01 '19 06:12 yaroslavvb

It does seem the current 'greedy' heuristic struggles a lot with this type of graph. Maybe because it is a very vertex centric approach whereas here because of the size of the hyper-edges there are far fewer overall.

I have some optimizers based on the linegraph tree decomposition, and they both find the optimal scaling quickly for the largest (Apollonian-6) graph:

  • QuickBB - based : scaling 4, FLOP count 8.5e10
  • FlowCutter - based: scaling 4, FLOP count 8.5e10

A hypergraph partitioner based optimizer, which usually outperforms both of those, still struggles a bit and gets:

  • scaling 5, FLOP count 8.8e11

Whereas a method that adaptively updates the usual greedy heuristic only gets:

  • scaling 22, FLOP count 2.0e44

jcmgray avatar Dec 01 '19 12:12 jcmgray

We have always focused on generalized expressions, but have not focused on specific networks. Are there other networks beyond Apollonian that we should consider adding to the test suite and optimizing for?

dgasmith avatar Dec 01 '19 14:12 dgasmith

Perhaps random subgraphs of k-trees?

Any einsum graph computable in O(n^{k+1}) time is a subgraph of a k-tree, so if you make it work with probability 1 for some k, you have achieved optimality for all problems of a certain size. k<=4 covers a bulk of real-world problems that are feasible to compute exactly

I've added some k-trees to colab. Generated using code here, using the simplex "gluing" approach from https://en.wikipedia.org/wiki/K-tree

Screenshot 2019-12-01 11 57 15

In real-world you have a fixed compute budget, which limits the largest size of intermediate factors. Under this constraint the problem of optimal decomposition is fixed-parameter tractable -- getting optimal decomposition is linear in size of graph. That particular algorithm doesn't seem practical, but even for heuristics, "maximum size of intermediate factor" constraint should limit the search space significantly.

Ideally, an einsum optimizer could just tell you that computation is not possible for your scaling budget, so you would then know to reformulate the problem. The algorithm in 10.5 of Kleinberg/Tardos book gives a linear time algorithm that either produces width=4*w decomposition, or proof that treewidth is more than w.

BTW, I'm curious what is your workflow for evaluating these methods.

  • How do you enable FlowCutter method?
  • What is an easier way to evaluating decompositions doesn't print a ton of stuff to stdout? (currently I'm hitting jupyter limits on larger graphs)

yaroslavvb avatar Dec 01 '19 19:12 yaroslavvb

What is an easier way to evaluating decompositions doesn't print a ton of stuff to stdout? (currently I'm hitting jupyter limits on larger graphs)

Ah the printed info is actually an object (with a verbose __repr__):

path, info = oe.contract_path(eq, *shapes, shapes=True)
info.opt_cost
info.largest_intermediate
max(info.scale_list)
# etc

jcmgray avatar Dec 02 '19 12:12 jcmgray

How do you enable FlowCutter method?

  1. create a line-graph of the einsum equation
  2. feed it to flowcutter which finds a tree decomposition
  3. generate a edge elimination ordering from the tree decomposition
  4. contract the tensors based on the edge order, using 'auto-hq' when the subgraph induced by that (hyper)edge has more than 2 tensors

jcmgray avatar Dec 02 '19 12:12 jcmgray

Couple more examples, I expected to always get O(n^2) schedule from tree structured graphs, but I get O(n^4) schedule 66% of the time using auto. The optimize=dp recovers it correctly though

Here's an example that takes 100 random subgraphs of tree structured expression below and prints scaling of resulting einsum optimization

image

import opt_einsum as oe
import numpy as np
import re
graph='{{1,2},{1,3},{1,4},{1,5},{1,6},{2,7},{2,8},{2,9},{2,10},{2,11},{3,12},{3,13},{3,14},{3,15},{3,16},{4,17},{4,18},{4,19},{4,20},{4,21},{5,22},{5,23},{5,24},{5,25},{5,26},{6,27},{6,28},{6,29},{6,30},{6,31},{7,32},{7,33},{7,34},{7,35},{7,36},{8,37},{8,38},{8,39},{8,40},{8,41},{9,42},{9,43},{9,44},{9,45},{9,46},{10,47},{10,48},{10,49},{10,50},{10,51},{11,52},{11,53},{11,54},{11,55},{11,56},{12,57},{12,58},{12,59},{12,60},{12,61},{13,62},{13,63},{13,64},{13,65},{13,66},{14,67},{14,68},{14,69},{14,70},{14,71},{15,72},{15,73},{15,74},{15,75},{15,76},{16,77},{16,78},{16,79},{16,80},{16,81},{17,82},{17,83},{17,84},{17,85},{17,86},{18,87},{18,88},{18,89},{18,90},{18,91},{19,92},{19,93},{19,94},{19,95},{19,96},{20,97},{20,98},{20,99},{20,100},{20,101},{21,102},{21,103},{21,104},{21,105},{21,106},{22,107},{22,108},{22,109},{22,110},{22,111},{23,112},{23,113},{23,114},{23,115},{23,116},{24,117},{24,118},{24,119},{24,120},{24,121},{25,122},{25,123},{25,124},{25,125},{25,126},{26,127},{26,128},{26,129},{26,130},{26,131},{27,132},{27,133},{27,134},{27,135},{27,136},{28,137},{28,138},{28,139},{28,140},{28,141},{29,142},{29,143},{29,144},{29,145},{29,146},{30,147},{30,148},{30,149},{30,150},{30,151},{31,152},{31,153},{31,154},{31,155},{31,156}}'

def tc(num):return chr(num+91)
def factor(first, second): return tc(int(first))+tc(int(second))

from collections import defaultdict

scaling = defaultdict(int)
for i in range(100):
  factors=[factor(first,second) for first,second in re.findall("\{(\d+),(\d+)\}", graph)]
  factors=np.random.choice(factors,replace=False, size=int(len(factors)*.5))
  expr=','.join(factors)+'->'
  path, info = oe.contract_path(expr, *[(100,100)]*len(factors), shapes=True, optimize='auto')
  scaling[max(info.scale_list)]+=1
print(scaling) # {4: 67, 3: 33}

yaroslavvb avatar Dec 08 '19 00:12 yaroslavvb

@jcmgray Do you think we can fix this with current algorithms or should we look at adding flowcutter/quickbb for these highly structured graphs?

dgasmith avatar Feb 02 '20 14:02 dgasmith

The practical need is the following -- if there's a schedule that computes an expression in reasonable time, discovering this schedule should also take reasonable time.

The assumption of "reasonable time" puts an upper bound on tree-width of the expression graph. Perhaps the dp search strategy can be improved to take that into account? Currently it takes too long even for a tree-structured expression

yaroslavvb avatar Feb 02 '20 22:02 yaroslavvb

BTW, I've recently reran this comparing against np.einsum, and the issue still exists. However, opt_einsum's dp optimizer seems to be better than np.einsum optimizer. Numpy seems to use this algorithm

Here's the colab comparing the methods

What motivated this is that there's @rgommers was working including np.einsum in the array API, which requires deciding which method to include in the API standard for optimization.

yaroslavvb avatar Jan 09 '22 17:01 yaroslavvb

I added the greedy and optimal versions of the opt_einsum code directly to NumPy a number of years ago (oof, blame says 5!). Parsing through the blame, it hasn't been touched since. I'm happy to contribute more of opt_einsum to NumPy if we can define the feature set everyone is happy with. The diversity of use cases np.einsum is used for made it a bit of a balancing act

What we are looking for here is that the minimal scaling, but not flop count is found in a reasonable time envelope for an arbitrary graph. Let me check if a few facts are true:

  • dp minimal scaling 👍 (except for small tree width) , time envelope 👎
  • greedy minimal scaling 👎 , time envelope 👍
  • We need an algorithm that does not require dependancies beyond NumPy such as networkX

There are a few ideas there which should be simple to implement if we can cover the use cases. Is there a working group we can join to find out more about the API standard?

dgasmith avatar Jan 15 '22 17:01 dgasmith

I learned about np.einsum from this discussion, @rgommers -- could you point us to where Python Array API discussions are happening?

yaroslavvb avatar Jan 15 '22 20:01 yaroslavvb

I learned about np.einsum from this discussion, @rgommers -- could you point us to where Python Array API discussions are happening?

Sure, this is the main repo: https://github.com/data-apis/array-api/issues

rgommers avatar Jan 17 '22 13:01 rgommers