opt_einsum icon indicating copy to clipboard operation
opt_einsum copied to clipboard

Performance difference between essentially the same einsum with dummy index interchanging

Open refraction-ray opened this issue 5 years ago • 3 comments

Hi, there.

The relevant issue with this observation is https://github.com/google/TensorNetwork/issues/650.

In simple code,

D = 800
D1 = 12
D2 = 10
a = np.random.rand(D1,D,D)
b = np.random.rand(D,D,D2)
print(timeit.timeit(lambda: np.einsum("ijk, jkl->il", a, b, optimize=True), number=100)) # 2.5s
print(timeit.timeit(lambda: np.einsum("ikj, kjl->il", a, b, optimize=True), number=100)) # 19s
print(timeit.timeit(lambda: oe.contract("ijk, jkl->il", a, b), number=100)) # 19s
print(timeit.timeit(lambda: oe.contract("ikj, kjl->il", a, b), number=100)) # 2.5s

This huge difference is possibly due to np.tensordot.

print(timeit.timeit(lambda: np.tensordot(a,b, [[1,2], [0,1]]), number=100)) #takes ~ 2.7 seconds
print(timeit.timeit(lambda: np.tensordot(a,b, [[2,1], [1,0]]), number=100)) #takes ~ 8.5 seconds
np.allclose(np.tensordot(a,b, [[1,2], [0,1]]),np.tensordot(a,b, [[2,1], [1,0]]) ) # True

The slower one may have invoked some transpose operations beyond matrix multiplications.

Any thoughts on this? And is there a persistent axes arguments order that make tensordot fast?

refraction-ray avatar Jun 09 '20 03:06 refraction-ray

I would have assumed that tensordot is fastest when the axes are contiguous last dimensions on the first array and contiguous first dimensions on the second array, such that matrix multiplication can be used (but slightly guessing here..).

If you use snakeviz you can see that it is indeed a reshape within tensordot that's costly on the first one: oe-1

Other lowest block is dot - & time spent just on that is similar to contraction two: oe-2

I guess the question for opt_einsum is whether there is a better default way to sort the axes supplied to tensordot at this point here.

jcmgray avatar Jun 09 '20 05:06 jcmgray

Just to be clear, one is obviously free to choose the same permutation to apply to both left_pos right_pos aka axes1 & axes2.

One canonical choice would be

left_pos, right_pos = zip(*sorted(zip(left_pos, right_pos)))

which should make the two contractions above the same, but is it always the fastest ordering?

jcmgray avatar Jun 09 '20 05:06 jcmgray

We talk about this some in #103. We can optimize the indices for a single contraction assuming that it is in C order, but it may cause issues down the line. In this example it doesn't matter for binary contractions however.

Another way to solve this is a custom tensor_blas function. I don't recall why we apparently deprecated it in favor of pure np.tensordot. I also though about pushing this into NumPy's tensordot implementation, but getting these things right for every edge case is a bit finicky.

dgasmith avatar Jun 09 '20 12:06 dgasmith