Contigugous loss after opt-einsum that may affect the efficiency of subsequent operations
After a tensor contraction, the resulting tensor from numpy.einsum is C-contiguous, and in opt_einsum, such properties may be lost. If there is a further tensor addition, the non-contiguous array may be slower. Is there any solution to let the resulting tensor from opt_einsum be still C-contiguous? If not, is there any option to specify the resulting tensor to be C-contiguous, while letting the searching of the path within C-contiguous constrain?
For example, the following code
import numpy as np
import opt_einsum as oe
import time
na = 5
nb = 5
nc = 8
nd = 8
ne = 2
nf = 1
ng = 8
nh = 8
ni = 8
nj = 8
dim_a = (na,nb,nc,nd,ne,nf)
dim_b = (ne,nf,ng,nh,ni,nj)
dim_c = (na,nb,nc,nd,ng,nh,ni,nj)
n_times = 20
contraction = 'abcdef,efghij -> abcdghij'
a = np.random.random(dim_a)
b = np.random.random(dim_b)
c = np.einsum(contraction, a, b)
d = oe.contract(contraction, a, b)
sum_array = np.zeros(dim_c)
print(a.flags)
print(b.flags)
print(c.flags)
print(d.flags)
print('list')
list_a = []
list_b = []
list_c = []
for n in range(n_times):
a = np.random.random(dim_a)
b = np.random.random(dim_b)
list_a.append(a)
list_b.append(b)
print('np')
start = time.time()
for n in range(n_times):
list_c.append( np.einsum(contraction, list_a[n], list_b[n]) )
print(time.time() - start)
sum_array = np.zeros(dim_c)
start = time.time()
for n in range(n_times):
np.add(sum_array, list_c[n], out = sum_array)
print(time.time() - start)
print('oe')
opt_path = oe.contract_expression(contraction, a.shape, b.shape, optimize = "optimal")
list_a = []
list_b = []
list_c = []
for n in range(n_times):
a = np.random.random(dim_a)
b = np.random.random(dim_b)
list_a.append(a)
list_b.append(b)
start = time.time()
for n in range(n_times):
list_c.append(opt_path(list_a[n], list_b[n]))
print(time.time() - start)
#start = time.time()
#for n in range(n_times):
# list_c[n] = np.ascontiguousarray(list_c[n])
#print(time.time() - start)
sum_array = np.zeros(dim_c)
start = time.time()
for n in range(n_times):
# list_c[n] = np.ascontiguousarray(list_c[n])
np.add(sum_array, list_c[n], out = sum_array)
print(time.time() - start)
The result is
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
list
np
1.9423425197601318
0.32601356506347656
oe
1.6317214965820312
1.2006006240844727
which the np.add part is much slower in opt_einsum than einsum. If I use np.ascontiguousarray to let the resulting array be C-contiguous, this step itself is slow.
My initial guess is the opt_einsum is deciding to use BLAS rather than einsum for this contraction. This choice is only marginally better given the size of your tensor is relatively small.
You can call contract(..., order="C") to force the resulting tensor to be C; however, I would note that this will be slightly slower as the final data will need to be copied to this structure as we do not optimize intermediate orders.
Thanks for the suggestion. Modifying the code above as
d = oe.contract(contraction, a, b, order="C")
then
print(d.flags)
still leads to
C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
Revising as
opt_path = oe.contract_expression(contraction, a.shape, b.shape, optimize = "optimal", order="C")
leads to
list
np
1.5694563388824463
0.24045681953430176
oe
1.5942034721374512
1.1450412273406982
or
list_c.append( oe.contract(contraction, list_a[n], list_b[n], order = "C") )
also similar.
I don't know how to see the version of opt_einsum. conda install -c conda-forge opt_einsum gives
The following packages will be SUPERSEDED by a higher-priority channel:
opt_einsum intel::opt_einsum-3.3.0-pyhd3eb1b0_1 --> conda-forge::opt_einsum-3.3.0-pyhd8ed1ab_1
suggests I am using 3.3.0.
I need to confirm, but this appears to be a bug.
I searched the source code of opt_einsum, did not find much entry about order, e.g., https://github.com/dgasmith/opt_einsum/blob/6a30cf5b2852e54824e188cb6b294890c5cf1d9c/opt_einsum/tests/test_edge_cases.py, line 51,
https://github.com/dgasmith/opt_einsum/blob/6a30cf5b2852e54824e188cb6b294890c5cf1d9c/opt_einsum/blas.py, lines 199 and 204
We should call np.asanyarray near the end of the computation which we currently do not do officially booting this to bug status.
Do recall this being dropped @jcmgray ?
Currently order is just passed (as part of einsum_kwargs) to any contractions that themselves use einsum, so changing the path optimizer can change whether the last contraction uses einsum and thus obeys order, otherwise yes it will be often ignored which is a bug...
In my opinion, it would be best to deprecate the order kwarg fully. Since as I think as @dgasmith mentioned, the speedups all come from dispatching to pairwise contractions, that when performed using essentially matrix multiplication, impose a certain ordering on the indices. This means only a single transpose at the end is called (which for numpy is just stride jiggling and not an actual operation - thus generally produces non-contiguous results).
There's not any obvious and efficient way to build ordering in that doesn't involve relying on details of the underlying pairwise contraction (e.g. cutensor as mentioned in #209), and opt_einsum tries to stay agnostic of that. Or just calling ascontiguousarray or torch_tensor.contiguous(), but that would be better done explicitly by the user in the rare cases that it's needed.
One could also force the last contraction to be np.einsum, but that would often be much slower. Note there are libraries such as hptt for fast transposing (i.e. making contiguous).
sidenote: can one search within or sort the index sorting?
Sorting how indices appear can be an important micro optimization for both via-BLAS and in fact direct (einsum/cutensor) contraction implementations. You can modify and visualize it a bit with cotengra:
By default, (as with opt_einsum), all intermediates are contiguous (blue then green indices), but a final permutation is needed.
One can try and make the final contraction 'as contiguous as possible', but even then its not always possible:
Thanks for the insights and suggestions. I am not sure if I understand your remarks fully, especially about the the order kwarg :(
I think
(i) it would be great if opt-einsum has a kind of order='C' flag when using einsum as the backend. (in the adefmopr, efgk, bijmnoq, ghjr, klnpq, chil -> cabd example, I tried numpy.einsum, numpy.einsum can still get C-contiguous with/without order='C' )
(ii) np.ascontiguousarray or hptt.ascontiguousarray may still be slower than (i).
If I modify the example code to include ascontiguousarray as (sorry for the lengthy)
import numpy as np
import opt_einsum as oe
import time
import hptt
na = 5
nb = 5
nc = 8
nd = 8
ne = 2
nf = 1
ng = 8
nh = 8
ni = 8
nj = 8
dim_a = (na,nb,nc,nd,ne,nf)
dim_b = (ne,nf,ng,nh,ni,nj)
dim_c = (na,nb,nc,nd,ng,nh,ni,nj)
n_times = 20
contraction = 'abcdef,efghij -> abcdghij'
a = np.random.random(dim_a)
b = np.random.random(dim_b)
c = np.einsum(contraction, a, b)
d = oe.contract(contraction, a, b)
sum_array = np.zeros(dim_c)
print(a.flags)
print(b.flags)
print(c.flags)
print(d.flags)
print('list')
list_a = []
list_b = []
list_c = []
for n in range(n_times):
a = np.random.random(dim_a)
b = np.random.random(dim_b)
list_a.append(a)
list_b.append(b)
print('np')
start = time.time()
for n in range(n_times):
list_c.append( np.einsum(contraction, list_a[n], list_b[n]) )
print(time.time() - start)
sum_array = np.zeros(dim_c)
start = time.time()
for n in range(n_times):
np.add(sum_array, list_c[n], out = sum_array)
print(time.time() - start)
print('oe')
opt_path = oe.contract_expression(contraction, a.shape, b.shape, optimize = "optimal")
list_a = []
list_b = []
list_c = []
for n in range(n_times):
a = np.random.random(dim_a)
b = np.random.random(dim_b)
list_a.append(a)
list_b.append(b)
start = time.time()
for n in range(n_times):
list_c.append(opt_path(list_a[n], list_b[n]))
print(time.time() - start)
sum_array = np.zeros(dim_c)
start = time.time()
for n in range(n_times):
#list_c[n] = np.ascontiguousarray(list_c[n])
list_c[n] = hptt.ascontiguousarray(list_c[n])
np.add(sum_array, list_c[n], out = sum_array)
print(time.time() - start)
the result is
list
np
1.2805311679840088
0.23432183265686035
oe
1.4543828964233398
1.785987138748169
if I use list_c[n] = np.ascontiguousarray(list_c[n]), I got (indeed hptt helps to some extent)
oe
1.5851459503173828
3.122382640838623
I think generally the optimizations opt_einsum performs relate to having 3+ terms to contract (and to a lesser extent, using GEMM where possible), where any final transpose call to make something contiguous is a relatively small cost.
In the cases here with 2 terms and thus a single pairwise contraction, I don't think there think is any general strategy to guarantee the order efficiently. In many contractions, doing a GEMM + transpose will be better, in this case, doing einsum directly is quicker.
But at this low level (a single pairwise contraction, or as another example, very small tensors) the responsibility for this kind of optimization is more on the user, IMHO.
Having said that, if you have some longer contraction expressions where it still makes a big difference that might be interesting.
Also, I don't know if this holds for the actual contractions you are interested in, but if you are summing over einsum expressions, then the best thing might be to incorporate a batch dimension into the contraction like so:
contraction = 'Xabcdef,Xfghij->abcdghij'