opt_einsum.contract is slower than raw numpy.einsum
Dear developer,
I am consdering to substitute every numpy.einsum in my code withopt_einsum.contract. However, I found in some cases, contract is much slower than the raw numpy.einsum without any path optimization.
I have three tensors in complex number with the shape as follows: A: (1166, 8000, 3, 3) B: (8000, 6, 1166, 3) C: (8000, 6, 1166, 3)
Then I used:
D = np.einsum("ijkl,jmik,jmil->jm", A, B, C)
and
from opt_einsum import contract
D = contract("ijkl,jmik,jmil->jm", A, B, C)
Using contract is about 3 to 4 times slower compared to np.einsum (e.g. ~58s and ~16s, respectively). According to the documentation of opt_einsum, if I don't specify the parameter optimize, the default is optimize=auto which keeps the path finding time below around 1 ms.
I am trying to understand why in this case opt_einsum will perform worse than the raw einsum of numpy. Any suggestion or comment?
Best, Changpeng
Hi @lin-cp, thanks for the interesting example. For what its worth I get a much more minor performance difference (4.5 vs 6.0 seconds), but worse nonetheless. If you check the optimal path:
Complete contraction: ijkl,jmik,jmil->jm
Naive scaling: 5
Optimized scaling: 5
Naive FLOP count: 1.511e+9
Optimized FLOP count: 1.343e+9
Theoretical speedup: 1.125e+0
Largest intermediate: 1.679e+8 elements
--------------------------------------------------------------------------------
scaling BLAS current remaining
--------------------------------------------------------------------------------
5 0 jmik,ijkl->jmil jmil,jmil->jm
4 0 jmil,jmil->jm jm->jm
you see that there is actually no scaling advantage to the optimized path, and only a very small absolute cost decrease. This cost decrease is probably majorly offset by the time to write the memory of the intermediate - thus leading to the worse performance.
In general for these very low arithmetic intensity, small contractions (ie. 3 terms) where most indices appear on all the tensors, there is little to be gained from optimizing the order of contractions (which is what opt_einsum does) and it mostly comes down to a variety of things which are not purely FLOPs.
@jcmgray Thanks for your inspiring explanation. Could you explain more what are the intermediate steps that require possibly intensive memory that worses the performance of opt_einsum?
I have two further questions:
- Is there any empirical rule to judge if using
opt_einsumwill help to save time? - Does the order of indices matter? Will it help to save time? For example, will there be a large difference between
"ij,jk->ik"and"ji,jk->ik"(i.e. first transpose the first tensor)?
The intermediate step is as listed above:
- "jmik,ijkl->jmil"
Is there any empirical rule to judge if using opt_einsum will help to save time?
- Generally the speedup listed by
contract_pathshould be quite a bit greater than 1. The speedups from ordering are much greater as the number of terms increases (exponentially so). The speedups from using matrix multiplication are much greater when the arithemetic intensity is high (~FLOPs / MOPs), which in practice means more 'contracted' indices vs 'batch' indices.
Does the order of indices matter?
- It does, but its hard to predict and
opt_einsumdoes not take it into account - just the order of contractions. Essentially if the indices/memory are already lined up to do the contraction via matrix multiplication that is best.
If you have a 3 term contraction and need to eke the best performance out, you probably need to benchmark things explicitly, the optimization opt_einsum is doing here is really just choosing from the 3 possibilities ((AB)C), ((AC)B) and (A(BC)) then dispatching these to matrix multiplication when possible (which is not the case in your example).
Closing as inactive.