Optimize algorithm for the best contraction permutation.
This PR optimizes the algorithm to choose the best contraction permutation.
Summary
- [x] Corrected error in computation of
cpu_cost. - [ ] Modify the algorithm to consider permutation cost.
- [ ] Avoid avoidable permutation of intermediate tensors in contraction.
- [ ] Avoid permuting
PA,PB,PCin contraction if they are already the outer most indices. - [ ] Avoid double permutation of the same tensor in the batched algorithm.
- [ ] Taking care of permutation symmetry.
Detail and Examples
Corrected error in computation of cpu_cost.
Example: C2["ijrs"] += B["gar"] * B["gbs"] * T2["ijab"]
Dimensions: i(4), j(8), r(60), s(8), g(280), a(60), b(60)
The two ways of contraction permutations are:
-
C2[ijrs]+={$[arbs]=B[gar]*B[gbs]}*T2[ijab]Original code will givecpu_cost = g(ar+bs) + ab(ij+rs)= 2,985,600Corrected code will givecpu_cost = garbs + abijrs= 539,136,000 FLOPs -
C2[ijrs]+={$[ijags]=T2[ijab]*B[gbs]}*B[gar]Original code will givecpu_cost = b(ija+gs) + ga(ijs+r)= 5,558,400Corrected code will givecpu_cost = bijags + gaijsr= 516,096,000 FLOPs
In this situation, the original code will choose contraction permutation 1, but the corrected code will choose 2.
Note here that in the original code, cpu_cost actually give the memory cost of all the contracted tensors. (eg. in 1, the total number of elements in $[arbs], B[gar], B[gbs], T2[ijab] is 2,985,600)
Avoid avoidable permutation of intermediate tensors in contraction.
Example: D["efgh"] = A["abdc"] * B["cdef"] * C["bagh"]
Assuming the optimizer decide to contract A and B first, then the result with C.
According to the current implementation, the intermediated tensor will be decided to be $[abef], and the contraction will be performed in two steps:
(1) $[abef] = A["abdc"] * B["cdef"]
(2) D["efgh"] = $[abef] * C["bagh"]
In case the size of A["abdc"] < B["cdef"], $[abef] < C["bagh"]:
In (1), A["abdc"] will be permuted to A["abcd"], then in (2), $[abef] will be permuted to $[baef].
Actually, the latter can be avoided if the intermediate tensor is decided to be $[baef]:
(1) $[baef] = A["abdc"] * B["cdef"]
(2) D["efgh"] = $[baef] * C["bagh"]
The computational cost of the permutation from A["abdc"] to A["abcd"] is the same as permutation to A["bacd"].
The computational cost of the intermediate permutation can be avoided by a wiser decision on the index order of the intermediate tensor.
Avoid permuting PA, PB, PC in contraction if they are already the outer most indices.
Example: C["abij"] = A["baik"] * B["abkj"]
Since a and b are Hadamard indices and are in different orders in C["abij"] and A["baik"], the current implementation will permute A["baik"] to A["abik"] first.
However, this permutation can be avoided by locating the correct ik batch in A when looping over Hadamard indices rather than explicitly permute them.
NOTE: Currently, the decision of permuting regarding P is after the decision of i,j,k, but this should be moved before the decision of i,j,k permutation.
Avoid double permutation of the same tensor in the batched algorithm.
Example: C["abcd"] = batched("a", A["eadf"] * B["efbc"])
In the current implementation, A["eadf"] will be first permute to A["aedf"]. Then in the contraction, each batch A["edf"] need to be permuted again to A["def"].
The latter can actually be avoided if in the first permutation A["eadf"] is permuted to A["adef"].
Status
- [ ] Ready to go.
Do you have any timings that support this? I'm just curious.
@jturney I have a little. I will definitely add some benchmarks in the PR. This PR is intended to be one investigating more ways to promote the performance. For example, some permutations may have the same CPU_cost and memory_cost by the current code, but they may experience different numbers of index permutations which is not taken into account in the current algorithm.