This PR optimizes the algorithm to choose the best contraction permutation.

Summary

[x] Corrected error in computation of cpu_cost.
[ ] Modify the algorithm to consider permutation cost.
[ ] Avoid avoidable permutation of intermediate tensors in contraction.
[ ] Avoid permuting PA, PB, PC in contraction if they are already the outer most indices.
[ ] Avoid double permutation of the same tensor in the batched algorithm.
[ ] Taking care of permutation symmetry.

Detail and Examples

Corrected error in computation of `cpu_cost`.

Example: C2["ijrs"] += B["gar"] * B["gbs"] * T2["ijab"] Dimensions: i(4), j(8), r(60), s(8), g(280), a(60), b(60) The two ways of contraction permutations are:

C2[ijrs]+={$[arbs]=B[gar]*B[gbs]}*T2[ijab] Original code will give cpu_cost = g(ar+bs) + ab(ij+rs)= 2,985,600 Corrected code will give cpu_cost = garbs + abijrs= 539,136,000 FLOPs
C2[ijrs]+={$[ijags]=T2[ijab]*B[gbs]}*B[gar] Original code will give cpu_cost = b(ija+gs) + ga(ijs+r)= 5,558,400 Corrected code will give cpu_cost = bijags + gaijsr= 516,096,000 FLOPs

In this situation, the original code will choose contraction permutation 1, but the corrected code will choose 2. Note here that in the original code, cpu_cost actually give the memory cost of all the contracted tensors. (eg. in 1, the total number of elements in $[arbs], B[gar], B[gbs], T2[ijab] is 2,985,600)

Avoid avoidable permutation of intermediate tensors in contraction.

Example: D["efgh"] = A["abdc"] * B["cdef"] * C["bagh"] Assuming the optimizer decide to contract A and B first, then the result with C. According to the current implementation, the intermediated tensor will be decided to be $[abef], and the contraction will be performed in two steps: (1) $[abef] = A["abdc"] * B["cdef"] (2) D["efgh"] = $[abef] * C["bagh"] In case the size of A["abdc"] < B["cdef"], $[abef] < C["bagh"]: In (1), A["abdc"] will be permuted to A["abcd"], then in (2), $[abef] will be permuted to $[baef]. Actually, the latter can be avoided if the intermediate tensor is decided to be $[baef]: (1) $[baef] = A["abdc"] * B["cdef"] (2) D["efgh"] = $[baef] * C["bagh"] The computational cost of the permutation from A["abdc"] to A["abcd"] is the same as permutation to A["bacd"]. The computational cost of the intermediate permutation can be avoided by a wiser decision on the index order of the intermediate tensor.

Avoid permuting `PA`, `PB`, `PC` in contraction if they are already the outer most indices.

Example: C["abij"] = A["baik"] * B["abkj"] Since a and b are Hadamard indices and are in different orders in C["abij"] and A["baik"], the current implementation will permute A["baik"] to A["abik"] first. However, this permutation can be avoided by locating the correct ik batch in A when looping over Hadamard indices rather than explicitly permute them.

NOTE: Currently, the decision of permuting regarding P is after the decision of i,j,k, but this should be moved before the decision of i,j,k permutation.

Avoid double permutation of the same tensor in the batched algorithm.

Example: C["abcd"] = batched("a", A["eadf"] * B["efbc"]) In the current implementation, A["eadf"] will be first permute to A["aedf"]. Then in the contraction, each batch A["edf"] need to be permuted again to A["def"]. The latter can actually be avoided if in the first permutation A["eadf"] is permuted to A["adef"].

Status

[ ] Ready to go.

Sep 10 '18 20:09 tyzhang1993

Do you have any timings that support this? I'm just curious.

Sep 11 '18 14:09 jturney

@jturney I have a little. I will definitely add some benchmarks in the PR. This PR is intended to be one investigating more ways to promote the performance. For example, some permutations may have the same CPU_cost and memory_cost by the current code, but they may experience different numbers of index permutations which is not taken into account in the current algorithm.

Sep 11 '18 14:09 tyzhang1993

Optimize algorithm for the best contraction permutation.

Summary

Detail and Examples

Corrected error in computation of cpu_cost.

Avoid avoidable permutation of intermediate tensors in contraction.

Avoid permuting PA, PB, PC in contraction if they are already the outer most indices.

Avoid double permutation of the same tensor in the batched algorithm.

Status

Corrected error in computation of `cpu_cost`.

Avoid permuting `PA`, `PB`, `PC` in contraction if they are already the outer most indices.