taichi
taichi copied to clipboard
complex number support and basic methods
Sorry in case being dumb. I just wonder could Taichi support complex number as a data type, and support some relevant basic arithmetic and methods(like conjugate).
I'm trying to compute a partial jacobian matrix using Taichi (already have a parallel numba version to compare), but cannot find related example for complex vector op.
Hi @mzy2240 , this will be a useful feature :-) For now you can take a look into https://github.com/taichi-dev/taichi_glsl/blob/master/taichi_glsl/classes.py for how a custom Complex can be implemented. Ideally, we'd like to support this in our C++ type system (CC @strongoier)
One update on this issue, also partially related to https://github.com/taichi-dev/taichi/issues/1206, I did a quick but probably unfair comparison between numba and taichi on a partial Jacobian matrix generation process (Im a power system researcher and the Jacobian matrix is a big thing in steady state power system analysis). You may say [ref]
In taichi you write the loops and operate on individual elements of a matrix
and yes you could absolutely write in a form of matrix multiplication, like below
from numpy import conj, diag, asarray
from scipy.sparse import issparse, csr_matrix as sparse
from numba import jit
from numpy import complex128, float64, int32
from numpy.core.multiarray import zeros, empty, array
def dSbus_dV(Ybus, V):
Ibus = Ybus * V # Ybus is a 2000*2000 sparse matrix and V is a 1-D array
ib = range(len(V))
diagV = sparse((V, (ib, ib)))
diagIbus = sparse((Ibus, (ib, ib)))
diagVnorm = sparse((V / abs(V), (ib, ib)))
dS_dVm = diagV * conj(Ybus * diagVnorm) + conj(diagIbus) * diagVnorm
dS_dVa = 1j * diagV * conj(diagIbus - Ybus * diagV)
return dS_dVm, dS_dVa
but actually the fact is counter-intuitive: using numba to do element-wise operation is 5 times fast than doing matrix-level multiplication using numpy and scipy (the numba version could finish 1000 times execution in less than 0.65 seconds, with Intel i9-9880H). Here is the numba code I am using:
@jit(nopython=True, cache=False)
def dSbus_dV_numba_sparse(Yx, Yp, Yj, V, Vnorm, Ibus):
# init buffer vector
buffer = zeros(len(V), dtype=complex128)
dS_dVm = Yx.copy()
dS_dVa = Yx.copy()
# iterate through sparse matrix
for r in range(len(Yp) - 1):
for k in range(Yp[r], Yp[r + 1]):
buffer[r] += Yx[k] * V[Yj[k]]
dS_dVm[k] *= Vnorm[Yj[k]]
dS_dVa[k] *= V[Yj[k]]
Ibus[r] += buffer[r]
buffer[r] = conj(buffer[r]) * Vnorm[r]
for r in range(len(Yp) - 1):
for k in range(Yp[r], Yp[r + 1]):
dS_dVm[k] = conj(dS_dVm[k]) * V[r]
if r == Yj[k]:
# diagonal elements
dS_dVa[k] = -Ibus[r] + dS_dVa[k]
dS_dVm[k] += buffer[r]
dS_dVa[k] = conj(-dS_dVa[k]) * (1j * V[r])
return dS_dVm, dS_dVa
def dSbus_dV_numba(Ybus, V, I=None):
if issparse(Ybus):
I = zeros(len(V), dtype=complex128) if I is None else -I
dS_dVm, dS_dVa = dSbus_dV_numba_sparse(Ybus.data, Ybus.indptr, Ybus.indices, V, V / abs(V), I)
return sparse((dS_dVm, Ybus.indices, Ybus.indptr)), sparse((dS_dVa, Ybus.indices, Ybus.indptr))
else:
I = zeros(len(V), dtype=complex128) if I is None else I
return dSbus_dV_dense(Ybus, V, I)
Since numba is doing the element-wise operation on matrix, I were thinking maybe I could also give a try using Taichi, and here is my script: (this is literally my second Taichi program so forgive me if it is not optimized)
ti.init(arch=ti.cpu, packed=True, debug=False)
@ti.func
def first_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa):
for k in range(Yp[r], Yp[r+1]):
buffer[r] += Yx[k] * V[Yj[k]]
dS_dVm[k] *= Vnorm[Yj[k]]
dS_dVa[k] *= V[Yj[k]]
@ti.func
def second_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa, Ibus):
for k in range(Yp[r], Yp[r+1]):
dS_dVm[k] *= V[r]
if r == Yj[k]:
dS_dVa[k] -= Ibus[r]
dS_dVm[k] += buffer[r]
dS_dVa[k] = -dS_dVa[k] * V[r]
@ti.kernel
def first_loop(Yp: ti.template(), Yx: ti.template(), V: ti.template(), Yj: ti.template(), Vnorm: ti.template(), buffer: ti.template(), dS_dVm: ti.template(), dS_dVa: ti.template(), Ibus: ti.template()):
for r in range(Yp.shape[0]-1):
first_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa)
Ibus[r] += buffer[r]
buffer[r] *= Vnorm[r]
@ti.kernel
def second_loop(Yp: ti.template(), Yx: ti.template(), V: ti.template(), Yj: ti.template(), Vnorm: ti.template(), buffer: ti.template(), dS_dVm: ti.template(), dS_dVa: ti.template(), Ibus: ti.template()):
for r in range(Yp.shape[0]-1):
second_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa, Ibus)
def dSbus_dV_taichi(Ybus, V):
I = ti.field(dtype=float, shape=len(V))
data = Ybus.data.astype(float64)
Yp = ti.field(dtype=int, shape=len(Ybus.indptr))
Yp.from_numpy(Ybus.indptr)
Yj = ti.field(dtype=int, shape=len(Ybus.indices))
Yj.from_numpy(Ybus.indices)
Vt = ti.field(dtype=float, shape=len(V))
Vt.from_numpy(V)
Vnorm = ti.field(dtype=float, shape=len(V))
Vnorm.from_numpy(V / abs(V))
buffer = ti.field(dtype=float, shape=len(V))
dS_dVm = ti.field(dtype=float, shape=len(data))
dS_dVa = ti.field(dtype=float, shape=len(data))
Yx = ti.field(dtype=float, shape=len(data))
dS_dVm.from_numpy(data)
dS_dVa.from_numpy(data)
Yx.from_numpy(data)
first_loop(Yp, Yx, Vt, Yj, Vnorm, buffer, dS_dVm, dS_dVa, I)
second_loop(Yp, Yx, Vt, Yj, Vnorm, buffer, dS_dVm, dS_dVa, I)
return sparse((dS_dVm.to_numpy(), Ybus.indices, Ybus.indptr)), sparse((dS_dVa.to_numpy(), Ybus.indices, Ybus.indptr))
Surprisingly the development process went smooth, however the performance is not good by all means. The taichi code is about 1000 times slower than numba (I removed the imaginary part so should be in the favor of Taichi I guess). I am not saying Taichi is really 1000 times slower than numba or numpy, and even though I attended all the Taichi classes about advanced data layout, sparse matrix and performance optimization, I still have no idea how to optimize the code in this usage, especially how to efficiently operate each element, exactly like how numba code does. So my question is, do you think it is a fair comparison (on this specific scope)? And how should I optimize the taichi script to achieve better results than numba?
Thanks and sorry if it is not relevant to the issue.
Update: Forgot to mention a few "optimization" I did for the taichi code: 1 TI_CACHE_RUNTIME_BITCODE is enabled 2 padding is disabled (did not see a reason why in the scope padding is needed) 3 only one iteration in a single kernel/function All the tests and profiling are carried in Python 3.8 with the latest Taichi 0.8.6.
What's the typical shape of Ybus and V?
Hi, would you mind update the full code s.t. we can also run some tests locally?
Hi, would you mind update the full code s.t. we can also run some tests locally?
Or I think a sample data or data generator of Ybus and V values will be good
Here is the zipped npz file for the Ybus matrix. You could load using scipy.sparse.load_npz after decompression. Ybus.zip For the V array, you could generate like this:
V = np.ones(2000)
Ybus is a 2000*2000 sparse matrix and V is a 1D array with length 2000. 2000 is definitely not a small size for power system in terms of buses, but it is also not something too large to be practical considering the largest case we have exceeds 100,000 buses (which means the Ybus is 100,000 * 100,000). For the sparsity, it is about 0.5%.
In case you guys want to test, I just updated the original post with all the imports I am using in the code.
Hi @mzy2240, there are a few findings I can share regarding to the 1000x performance diff ;/. I believe you were comparing the performance of an after-compilation Numba version with a with-compilation Taichi version. Here are some details:
First, the Numba code i measured (some small modifications based on the above code just to make it executable):
from numba import jit
import numpy as np
import time
from scipy import sparse
@jit(nopython=True, cache=False)
def dSbus_dV_numba_sparse(Yx, Yp, Yj, V, Vnorm, Ibus):
# init buffer vector
buffer = np.zeros(len(V), dtype=np.complex128)
dS_dVm = Yx.copy()
dS_dVa = Yx.copy()
# iterate through sparse matrix
for r in range(len(Yp) - 1):
for k in range(Yp[r], Yp[r + 1]):
buffer[r] += Yx[k] * V[Yj[k]]
dS_dVm[k] *= Vnorm[Yj[k]]
dS_dVa[k] *= V[Yj[k]]
Ibus[r] += buffer[r]
buffer[r] = np.conj(buffer[r]) * Vnorm[r]
for r in range(len(Yp) - 1):
for k in range(Yp[r], Yp[r + 1]):
dS_dVm[k] = np.conj(dS_dVm[k]) * V[r]
if r == Yj[k]:
# diagonal elements
dS_dVa[k] = -Ibus[r] + dS_dVa[k]
dS_dVm[k] += buffer[r]
dS_dVa[k] = np.conj(-dS_dVa[k]) * (1j * V[r])
return dS_dVm, dS_dVa
def dSbus_dV_numba(Ybus, V, I=None):
if sparse.issparse(Ybus):
I = np.zeros(len(V), dtype=np.complex128) if I is None else -I
dS_dVm, dS_dVa = dSbus_dV_numba_sparse(Ybus.data, Ybus.indptr, Ybus.indices, V, V / abs(V), I)
return sparse.csr_matrix((dS_dVm, Ybus.indices, Ybus.indptr)), sparse.csr_matrix((dS_dVa, Ybus.indices, Ybus.indptr))
else:
I = np.zeros(len(V), dtype=np.complex128) if I is None else I
return dSbus_dV_dense(Ybus, V, I)
Ybus = sparse.load_npz('Ybus.npz')
V = np.ones(2000)
start = time.time()
dSbus_dV_numba(Ybus, V)
end = time.time()
print("Elapsed (with compilation) = %s" % (end - start))
start = time.time()
dSbus_dV_numba(Ybus, V)
end = time.time()
print("Elapsed (after compilation) = %s" % (end - start))
As suggested by How to measure the performance of Numba? On my local PC with i9-11900K. This code takes
Elapsed (with compilation) = 0.3006782531738281
Elapsed (after compilation) = 0.0001761913299560547
Now the Taichi code you posted:
import taichi as ti
import numpy as np
import time
from scipy import sparse
ti.init(arch=ti.cpu, debug=False)
@ti.func
def first_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa):
for k in range(Yp[r], Yp[r+1]):
buffer[r] += Yx[k] * V[Yj[k]]
dS_dVm[k] *= Vnorm[Yj[k]]
dS_dVa[k] *= V[Yj[k]]
@ti.func
def second_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa, Ibus):
for k in range(Yp[r], Yp[r+1]):
dS_dVm[k] *= V[r]
if r == Yj[k]:
dS_dVa[k] -= Ibus[r]
dS_dVm[k] += buffer[r]
dS_dVa[k] = -dS_dVa[k] * V[r]
@ti.kernel
def first_loop(Yp: ti.template(), Yx: ti.template(), V: ti.template(), Yj: ti.template(), Vnorm: ti.template(), buffer: ti.template(), dS_dVm: ti.template(), dS_dVa: ti.template(), Ibus: ti.template()):
for r in range(Yp.shape[0]-1):
first_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa)
Ibus[r] += buffer[r]
buffer[r] *= Vnorm[r]
@ti.kernel
def second_loop(Yp: ti.template(), Yx: ti.template(), V: ti.template(), Yj: ti.template(), Vnorm: ti.template(), buffer: ti.template(), dS_dVm: ti.template(), dS_dVa: ti.template(), Ibus: ti.template()):
for r in range(Yp.shape[0]-1):
second_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa, Ibus)
def dSbus_dV_taichi(Ybus, V):
I = ti.field(dtype=float, shape=len(V))
data = Ybus.data.astype(np.float64)
Yp = ti.field(dtype=int, shape=len(Ybus.indptr))
Yp.from_numpy(Ybus.indptr)
Yj = ti.field(dtype=int, shape=len(Ybus.indices))
Yj.from_numpy(Ybus.indices)
Vt = ti.field(dtype=float, shape=len(V))
Vt.from_numpy(V)
Vnorm = ti.field(dtype=float, shape=len(V))
Vnorm.from_numpy(V / abs(V))
buffer = ti.field(dtype=float, shape=len(V))
dS_dVm = ti.field(dtype=float, shape=len(data))
dS_dVa = ti.field(dtype=float, shape=len(data))
Yx = ti.field(dtype=float, shape=len(data))
dS_dVm.from_numpy(data)
dS_dVa.from_numpy(data)
Yx.from_numpy(data)
first_loop(Yp, Yx, Vt, Yj, Vnorm, buffer, dS_dVm, dS_dVa, I)
second_loop(Yp, Yx, Vt, Yj, Vnorm, buffer, dS_dVm, dS_dVa, I)
return sparse.csr_matrix((dS_dVm.to_numpy(), Ybus.indices, Ybus.indptr)), sparse.csr_matrix((dS_dVa.to_numpy(), Ybus.indices, Ybus.indptr))
Ybus = sparse.load_npz('Ybus.npz')
V = np.ones(2000)
start = time.time()
dSbus_dV_taichi(Ybus, V)
end = time.time()
print("Elapsed (with compilation) = %s" % (end - start))
start = time.time()
dSbus_dV_taichi(Ybus, V)
end = time.time()
print("Elapsed (after compilation) = %s" % (end - start))
If we just replicate the above measurement, I got
Elapsed (with compilation) = 0.30458903312683105
Elapsed (after compilation) = 0.3131248950958252
At this point, If i compare the two after-compilation version, Numba is 1777 times faster ;/ However, as can be seen from the two execution times in Taichi measurement, the second run is not cached at all.
Why not cached?
Actually, Taichi JIT compiler and Numba JIT has similarities that we all cache the function based on the arguments type. So if it is called again, the cached version can be reused instead of having to recompile. However, in Taichi, when ti.template() is used as parameter type, we cache by value instead of by type. Now looking back to the Taichi code, in the def dSbus_dV_taichi(Ybus, V):
function, all fields are re-declared. Hence, you won't get a cached kernel returned.
How to optimize? It is quite simple, we can make all field decls global, such that kernels can be seen to have same signatures when being recompiled. Here is the code:
import taichi as ti
import time
import numpy as np
from scipy import sparse
ti.init(arch=ti.cpu, debug=False)
@ti.func
def first_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa):
for k in range(Yp[r], Yp[r+1]):
buffer[r] += Yx[k] * V[Yj[k]]
dS_dVm[k] *= Vnorm[Yj[k]]
dS_dVa[k] *= V[Yj[k]]
@ti.func
def second_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa, Ibus):
for k in range(Yp[r], Yp[r+1]):
dS_dVm[k] *= V[r]
if r == Yj[k]:
dS_dVa[k] -= Ibus[r]
dS_dVm[k] += buffer[r]
dS_dVa[k] = -dS_dVa[k] * V[r]
@ti.kernel
def first_loop(Yp: ti.template(), Yx: ti.template(), V: ti.template(), Yj: ti.template(), Vnorm: ti.template(), buffer: ti.template(), dS_dVm: ti.template(), dS_dVa: ti.template(), Ibus: ti.template()):
for r in range(Yp.shape[0]-1):
first_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa)
Ibus[r] += buffer[r]
buffer[r] *= Vnorm[r]
@ti.kernel
def second_loop(Yp: ti.template(), Yx: ti.template(), V: ti.template(), Yj: ti.template(), Vnorm: ti.template(), buffer: ti.template(), dS_dVm: ti.template(), dS_dVa: ti.template(), Ibus: ti.template()):
for r in range(Yp.shape[0]-1):
second_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa, Ibus)
# load
Ybus = sparse.load_npz('Ybus.npz')
data = Ybus.data.astype(np.float64)
V = np.ones(2000)
# decl
I = ti.field(dtype=float, shape=len(V))
Yp = ti.field(dtype=int, shape=len(Ybus.indptr))
Yj = ti.field(dtype=int, shape=len(Ybus.indices))
Vt = ti.field(dtype=float, shape=len(V))
Vnorm = ti.field(dtype=float, shape=len(V))
buffer = ti.field(dtype=float, shape=len(V))
dS_dVm = ti.field(dtype=float, shape=len(data))
dS_dVa = ti.field(dtype=float, shape=len(data))
Yx = ti.field(dtype=float, shape=len(data))
def dSbus_dV_taichi():
# inits
Yp.from_numpy(Ybus.indptr)
Yj.from_numpy(Ybus.indices)
Vt.from_numpy(V)
Vnorm.from_numpy(V / abs(V))
dS_dVm.from_numpy(data)
dS_dVa.from_numpy(data)
Yx.from_numpy(data)
first_loop(Yp, Yx, Vt, Yj, Vnorm, buffer, dS_dVm, dS_dVa, I)
second_loop(Yp, Yx, Vt, Yj, Vnorm, buffer, dS_dVm, dS_dVa, I)
return sparse.csr_matrix((dS_dVm.to_numpy(), Ybus.indices, Ybus.indptr)), sparse.csr_matrix((dS_dVa.to_numpy(), Ybus.indices, Ybus.indptr))
start = time.time()
dSbus_dV_taichi()
end = time.time()
print("Elapsed (with compilation) = %s" % (end - start))
start = time.time()
dSbus_dV_taichi()
end = time.time()
print("Elapsed (after compilation) = %s" % (end - start))
Now, the results are more comparable:
Elapsed (with compilation) = 0.2619659900665283
Elapsed (after compilation) = 0.0010476112365722656
But, Taichi is still 6 times slower. Actually, the from_numpy conversions are not necessary in this case. Since in Taichi, we support direct operations using external arrays such as numpy or torch. This is a quickly hand-crafted version only using numpy arrays:
import taichi as ti
import numpy as np
import time
from scipy import sparse
ti.init(arch=ti.cpu, debug=False)
@ti.func
def first_inner_loop(r, Yp: ti.template(), Yx: ti.template(), V: ti.template(), Yj: ti.template(), Vnorm: ti.template(), buffer: ti.template(), dS_dVm: ti.template(), dS_dVa: ti.template()):
for k in range(Yp[r], Yp[r+1]):
buffer[r] += Yx[k] * V[Yj[k]]
dS_dVm[k] *= Vnorm[Yj[k]]
dS_dVa[k] *= V[Yj[k]]
@ti.func
def second_inner_loop(r, Yp: ti.template(), Yx: ti.template(), V: ti.template(), Yj: ti.template(), Vnorm: ti.template(), buffer: ti.template(), dS_dVm: ti.template(), dS_dVa: ti.template(), Ibus: ti.template()):
for k in range(Yp[r], Yp[r+1]):
dS_dVm[k] *= V[r]
if r == Yj[k]:
dS_dVa[k] -= Ibus[r]
dS_dVm[k] += buffer[r]
dS_dVa[k] = -dS_dVa[k] * V[r]
@ti.kernel
def two_loops(Yp: ti.any_arr(), Yx: ti.any_arr(), V: ti.any_arr(), Yj: ti.any_arr(), Vnorm: ti.any_arr(), buffer: ti.any_arr(), dS_dVm: ti.any_arr(), dS_dVa: ti.any_arr(), Ibus: ti.any_arr()):
for r in range(Yp.shape[0]-1):
first_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa)
Ibus[r] += buffer[r]
buffer[r] *= Vnorm[r]
for r in range(Yp.shape[0]-1):
second_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa, Ibus)
def dSbus_dV_taichi(Ybus, V):
dS_dVm = Ybus.data.astype(np.float64)
dS_dVa = Ybus.data.astype(np.float64)
buffer = np.ones(dtype=float, shape=len(V))
I = np.ones(dtype=float, shape=len(V))
two_loops(Ybus.indptr, data, V, Ybus.indices, Vnorm, buffer, dS_dVm, dS_dVa, I)
#second_loop(Ybus.indptr, data, V, Ybus.indices, Vnorm, buffer, dS_dVm, dS_dVa, I)
return sparse.csr_matrix((dS_dVm, Ybus.indices, Ybus.indptr)), sparse.csr_matrix((dS_dVa, Ybus.indices, Ybus.indptr))
Ybus = sparse.load_npz('Ybus.npz')
V = np.ones(2000)
Vnorm = V / abs(V)
data = Ybus.data.astype(np.float64)
start = time.time()
dSbus_dV_taichi(Ybus, V)
end = time.time()
print("Elapsed (with compilation) = %s" % (end - start))
start = time.time()
dSbus_dV_taichi(Ybus, V)
end = time.time()
print("Elapsed (after compilation) = %s" % (end - start))
Above code gives
Elapsed (with compilation) = 0.04335379600524902
Elapsed (after compilation) = 0.0003981590270996094
We are still around ~2x slower than Numba ;), but hopefully this gives you idea of how to proceed. Let us know if there are further questions. Comments/Suggestions are welcome.
Wow this is definitely a tutorial-level answer and should be re-used in the document to maximize its value! Thank you for the detailed explanation @qiao-bo !
Now it seems the major part left is the official support for complex number and its basic arithmetic.
In case someone is interested to test the complex array version (a big step toward applying taichi in power system analysis!), here is the updated code that is modified based on the above code (thanks to @qiao-bo ) and supports complex numbers. Surprisingly, this version is faster than numba no matter with compilation or after compilation in the test environment (Intel i9-9980H Win10 Taichi 0.88). The code looks a little bit verbose due to the lack of native support of complex numbers and arrays, but at least it works and performs well. Of course, please let me know if you find anything that could use an optimization.
import taichi as ti
import numpy as np
from scipy import sparse
from copy import deepcopy
ti.init(arch=ti.cpu, debug=False)
@ti.func
def first_inner_loop(r, Yp: ti.template(), Yx_real: ti.template(), Yx_imag: ti.template(), V_real: ti.template(), V_imag: ti.template(), Yj: ti.template(), Vnorm_real: ti.template(), Vnorm_imag: ti.template(), buffer_real: ti.template(), buffer_imag: ti.template(), dS_dVm_real: ti.template(), dS_dVm_imag: ti.template(), dS_dVa_real: ti.template(), dS_dVa_imag: ti.template()):
for k in range(Yp[r], Yp[r+1]):
buffer_real[r] += Yx_real[k] * V_real[Yj[k]] - Yx_imag[k] * V_imag[Yj[k]]
buffer_imag[r] += Yx_real[k] * V_imag[Yj[k]] + Yx_imag[k] * V_real[Yj[k]]
dS_dVm_real[k] = dS_dVm_real[k] * Vnorm_real[Yj[k]] - dS_dVm_imag[k] * Vnorm_imag[Yj[k]]
dS_dVm_imag[k] = dS_dVm_real[k] * Vnorm_imag[Yj[k]] + dS_dVm_imag[k] * Vnorm_real[Yj[k]]
dS_dVa_real[k] = dS_dVa_real[k] * V_real[Yj[k]] - dS_dVa_imag[k] * V_imag[Yj[k]]
dS_dVa_imag[k] = dS_dVa_real[k] * V_imag[Yj[k]] + dS_dVa_imag[k] * V_real[Yj[k]]
@ti.func
def second_inner_loop(r, Yp: ti.template(), V_real: ti.template(), V_imag: ti.template(), Yj: ti.template(), buffer_real: ti.template(), buffer_imag: ti.template(), dS_dVm_real: ti.template(), dS_dVm_imag: ti.template(), dS_dVa_real: ti.template(), dS_dVa_imag: ti.template(), Ibus_real: ti.template(), Ibus_imag: ti.template()):
for k in range(Yp[r], Yp[r+1]):
dS_dVm_real[k] = dS_dVm_real[k] * V_real[r] + dS_dVm_imag[k] * V_imag[r]
dS_dVm_imag[k] = dS_dVm_real[k] * V_imag[r] - dS_dVm_imag[k] * V_real[r]
if r == Yj[k]:
dS_dVa_real[k] = - Ibus_real[r] + dS_dVa_real[k]
dS_dVa_imag[k] = - Ibus_imag[r] + dS_dVa_imag[k]
dS_dVm_real[k] += buffer_real[r]
dS_dVm_imag[k] += buffer_imag[r]
dS_dVa_real_temp = dS_dVa_real[k]
dS_dVa_imag_temp = dS_dVa_imag[k]
dS_dVa_imag[k] = -dS_dVa_real_temp * V_real[r] - dS_dVa_imag_temp * V_imag[r]
dS_dVa_real[k] = dS_dVa_real_temp * V_imag[r] - dS_dVa_imag_temp * V_real[r]
@ti.kernel
def two_loops(Yp: ti.any_arr(), Yx_real: ti.any_arr(), Yx_imag: ti.any_arr(), V_real: ti.any_arr(), V_imag: ti.any_arr(), Yj: ti.any_arr(), Vnorm_real: ti.any_arr(), Vnorm_imag: ti.any_arr(), buffer_real: ti.any_arr(), buffer_imag: ti.any_arr(), dS_dVm_real: ti.any_arr(), dS_dVm_imag: ti.any_arr(), dS_dVa_real: ti.any_arr(), dS_dVa_imag: ti.any_arr(), Ibus_real: ti.any_arr(), Ibus_imag: ti.any_arr()):
for r in range(Yp.shape[0]-1):
first_inner_loop(r, Yp, Yx_real, Yx_imag, V_real, V_imag, Yj, Vnorm_real, Vnorm_imag, buffer_real, buffer_imag, dS_dVm_real, dS_dVm_imag, dS_dVa_real, dS_dVa_imag)
Ibus_real[r] += buffer_real[r]
Ibus_imag[r] += buffer_imag[r]
buffer_real[r] = buffer_real[r] * Vnorm_real[r] + buffer_imag[r] * Vnorm_imag[r]
buffer_imag[r] = buffer_real[r] * Vnorm_imag[r] - buffer_imag[r] * Vnorm_real[r]
for r in range(Yp.shape[0]-1):
second_inner_loop(r, Yp, V_real, V_imag, Yj, buffer_real, buffer_imag, dS_dVm_real, dS_dVm_imag, dS_dVa_real, dS_dVa_imag, Ibus_real, Ibus_imag)
def dSbus_dV_taichi(Ybus, V):
dS_dVm_real = deepcopy(Ybus.data.real)
dS_dVm_imag = deepcopy(Ybus.data.imag)
dS_dVa_real = deepcopy(Ybus.data.real)
dS_dVa_imag = deepcopy(Ybus.data.imag)
buffer_real = np.zeros(dtype=float, shape=len(V))
buffer_imag = np.zeros(dtype=float, shape=len(V))
I_real = np.zeros(dtype=float, shape=len(V))
I_imag = np.zeros(dtype=float, shape=len(V))
# V = np.ones(2000)
Vnorm = V / abs(V)
Vnorm_real = Vnorm.real
Vnorm_imag = Vnorm.imag
V_real = V.real
V_imag = V.imag
data_real = deepcopy(Ybus.data.real)
data_imag = deepcopy(Ybus.data.imag)
two_loops(Ybus.indptr, data_real, data_imag, V_real, V_imag, Ybus.indices, Vnorm_real, Vnorm_imag, buffer_real, buffer_imag, dS_dVm_real, dS_dVm_imag, dS_dVa_real, dS_dVa_imag, I_real, I_imag)
#second_loop(Ybus.indptr, data, V, Ybus.indices, Vnorm, buffer, dS_dVm, dS_dVa, I)
dS_dVm = np.empty(len(dS_dVm_real), dtype=np.complex128)
dS_dVm.real = dS_dVm_real
dS_dVm.imag = dS_dVm_imag
dS_dVa = np.empty(len(dS_dVm_real), dtype=np.complex128)
dS_dVa.real = dS_dVa_real
dS_dVa.imag = dS_dVa_imag
return sparse.csr_matrix((dS_dVm, Ybus.indices, Ybus.indptr)), sparse.csr_matrix((dS_dVa, Ybus.indices, Ybus.indptr))
Ybus = sparse.load_npz('Ybus.npz') # Ybus is a complex128 sparse matrix
V = np.ones(2000)
import time
start = time.time()
for i in range(1):
dSbus_dV_taichi(Ybus, V)
end = time.time()
print("Elapsed (with compilation) = %s" % (end - start))
start = time.time()
dSbus_dV_taichi(Ybus, V)
end = time.time()
print("Elapsed (after compilation) = %s" % (end - start))
Waiting for official support on complex numbers as well. Now I resort to 2D vec fields since I don't need complex number multiplication.
Update:
I found out a serious bug in the example I gave above. Since I am computing the real and imaginary parts of each variable separately, it is important to freeze the values before computing if the value depends on both real and imaginary parts. I am not sure whether it is the most elegant and performant way to do it, but it works and the result exactly matches the numpy and numba version for complex jacobian matrix.
Original:
# As an example, other functions have same problem
@ti.func
def first_inner_loop(r, Yp: ti.template(), Yx_real: ti.template(), Yx_imag: ti.template(), V_real: ti.template(), V_imag: ti.template(), Yj: ti.template(), Vnorm_real: ti.template(), Vnorm_imag: ti.template(), buffer_real: ti.template(), buffer_imag: ti.template(), dS_dVm_real: ti.template(), dS_dVm_imag: ti.template(), dS_dVa_real: ti.template(), dS_dVa_imag: ti.template()):
for k in range(Yp[r], Yp[r+1]):
buffer_real[r] += Yx_real[k] * V_real[Yj[k]] - Yx_imag[k] * V_imag[Yj[k]]
buffer_imag[r] += Yx_real[k] * V_imag[Yj[k]] + Yx_imag[k] * V_real[Yj[k]]
dS_dVm_real[k] = dS_dVm_real[k] * Vnorm_real[Yj[k]] - dS_dVm_imag[k] * Vnorm_imag[Yj[k]]
dS_dVm_imag[k] = dS_dVm_real[k] * Vnorm_imag[Yj[k]] + dS_dVm_imag[k] * Vnorm_real[Yj[k]]
dS_dVa_real[k] = dS_dVa_real[k] * V_real[Yj[k]] - dS_dVa_imag[k] * V_imag[Yj[k]]
dS_dVa_imag[k] = dS_dVa_real[k] * V_imag[Yj[k]] + dS_dVa_imag[k] * V_real[Yj[k]]
Corrected:
@ti.func
def first_inner_loop(r, Yp: ti.template(), Yx_real: ti.template(), Yx_imag: ti.template(), V_real: ti.template(), V_imag: ti.template(), Yj: ti.template(), Vnorm_real: ti.template(), Vnorm_imag: ti.template(), buffer_real: ti.template(), buffer_imag: ti.template(), dS_dVm_real: ti.template(), dS_dVm_imag: ti.template(), dS_dVa_real: ti.template(), dS_dVa_imag: ti.template()):
for k in range(Yp[r], Yp[r+1]):
buffer_real[r] += Yx_real[k] * V_real[Yj[k]] - Yx_imag[k] * V_imag[Yj[k]]
buffer_imag[r] += Yx_real[k] * V_imag[Yj[k]] + Yx_imag[k] * V_real[Yj[k]]
dS_dVm_real_temp = dS_dVm_real[k] # freeze dS_dVm[k]
dS_dVm_imag_temp = dS_dVm_imag[k]
dS_dVm_real[k] = dS_dVm_real_temp * Vnorm_real[Yj[k]] - dS_dVm_imag_temp * Vnorm_imag[Yj[k]]
dS_dVm_imag[k] = dS_dVm_real_temp * Vnorm_imag[Yj[k]] + dS_dVm_imag_temp * Vnorm_real[Yj[k]]
dS_dVa_real_temp = dS_dVa_real[k] # freeze dS_dVa[k]
dS_dVa_imag_temp = dS_dVa_imag[k]
dS_dVa_real[k] = dS_dVa_real_temp * V_real[Yj[k]] - dS_dVa_imag_temp * V_imag[Yj[k]]
dS_dVa_imag[k] = dS_dVa_real_temp * V_imag[Yj[k]] + dS_dVa_imag_temp * V_real[Yj[k]]
Another update: The correction seems to make the performance noticeably worse, unfortunately. Wondering if there is a better way to do that.
------------------------------------------------------Update----------------------------------------------------------- The above corrected version could be further improved, cause only the real part needs freeze.
@ti.func
def first_inner_loop(r, Yp: ti.template(), Yx_real: ti.template(), Yx_imag: ti.template(), V_real: ti.template(), V_imag: ti.template(), Yj: ti.template(), Vnorm_real: ti.template(), Vnorm_imag: ti.template(), buffer_real: ti.template(), buffer_imag: ti.template(), dS_dVm_real: ti.template(), dS_dVm_imag: ti.template(), dS_dVa_real: ti.template(), dS_dVa_imag: ti.template()):
for k in range(Yp[r], Yp[r+1]):
buffer_real[r] += Yx_real[k] * V_real[Yj[k]] - Yx_imag[k] * V_imag[Yj[k]]
buffer_imag[r] += Yx_real[k] * V_imag[Yj[k]] + Yx_imag[k] * V_real[Yj[k]]
dS_dVm_real_temp = dS_dVm_real[k] # freeze dS_dVm[k]
dS_dVm_real[k] = dS_dVm_real_temp * Vnorm_real[Yj[k]] - dS_dVm_imag[k] * Vnorm_imag[Yj[k]]
dS_dVm_imag[k] = dS_dVm_real_temp * Vnorm_imag[Yj[k]] + dS_dVm_imag[k] * Vnorm_real[Yj[k]]
dS_dVa_real_temp = dS_dVa_real[k] # freeze dS_dVa[k]
dS_dVa_real[k] = dS_dVa_real_temp * V_real[Yj[k]] - dS_dVa_imag[k] * V_imag[Yj[k]]
dS_dVa_imag[k] = dS_dVa_real_temp * V_imag[Yj[k]] + dS_dVa_imag[k] * V_real[Yj[k]]
@qiao-bo In the above example you gave, the external numpy array are directly passed to the taichi kernel, and I assume they are in soa pattern. What if I want to pass those numpy arrays in array of structs (aos) pattern (let's say they all have the same length) so that I could better utilize cache for better efficiency, what might be the most efficient (best?) way to convert the numpy array into aos in taichi kernel? How much performance gain should we expect? Thanks!
@mzy2240, how much performance degradation did you observe for the corrected code? (BTW, the imag part seems do not need any freeze, only the real part is needed?).
Regarding to the AOS/SOA pattern, since these arrays are passed as independent parameters, we can assume neither AOS nor SOA. If you want to explicitly control the memory layout, you have to use field
. Taichi field is designed to allow flexible data layout composition. check out the docs here.
@mzy2240, how much performance degradation did you observe for the corrected code? (BTW, the imag part seems do not need any freeze, only the real part is needed?).
I observed a 10X degradation and you are right, only the real part needs freeze. (so the actual performance difference for the same function between taichi and numba will be somewhere between 1 and 10 I guess)
Regarding to the AOS/SOA pattern, since these arrays are passed as independent parameters, we can assume neither AOS nor SOA. If you want to explicitly control the memory layout, you have to use field. Taichi field is designed to allow flexible data layout composition. check out the docs here.
Does that mean in order to use the field, I have to do the explicit conversion first which will unfortunately introduce overhead again? If we choose to neglect the conversion overhead, how much performance gain could be expected by using AOS over SOA in the above example?
I observed a 10X degradation and you are right, only the real part needs freeze. (so the actual performance difference for the same function between taichi and numba will be somewhere between 1 and 10 I guess)
that is a bit strange. I will have a look at this and come back. The freeze part is the only modification compared to the previous code, right?
Does that mean in order to use the field, I have to do the explicit conversion first which will unfortunately introduce overhead again? If we choose to neglect the conversion overhead, how much performance gain could be expected by using AOS over SOA in the above example?
yep, Field
requires copy from Numpy, that is indeed some overhead. It's really hard to give a number here ;/ since the speedup depends on the cache size, array length, etc. i guess we will have to try and see ;/
that is a bit strange. I will have a look at this and come back. The freeze part is the only modification compared to the previous code, right?
Yes freeze part is the only modification, but the above modified single function is just an example, I did the same modification for all the functions. And yes I feel surprised as well when I had the results.
yep, Field requires copy from Numpy, that is indeed some overhead. It's really hard to give a number here ;/ since the speedup depends on the cache size, array length, etc. i guess we will have to try and see ;/
Yeah cache size and array length definitely will have an impact on the performance. The reason why I ask is that I feel the above functions might still be computation-bounded, so savings on data access may not be that noticeable.
Hi, first sorry for the late reply. Regarding to the performance difference, I just ran the previous code with and without "freeze dS_dVm[k]". On my local PC with i9-11900K there seems not much differences. I got something like
without
[Taichi] version 0.9.1, llvm 10.0.0, commit e9e15196, linux, python 3.8.12
[Taichi] Starting on arch=x64
Elapsed (with compilation) = 0.05987668037414551
Elapsed (after compilation) = 0.0004096031188964844
with
[Taichi] version 0.9.1, llvm 10.0.0, commit e9e15196, linux, python 3.8.12
[Taichi] Starting on arch=x64
Elapsed (with compilation) = 0.05990171432495117
Elapsed (after compilation) = 0.0004048347473144531
can you maybe try again if the performance differences still exit?
This issue seems to be ignored for a long time. Do we have any updates?
@ifsheldon I can see they are adding some support on basic complex operations, not sure when the complex number types will be supported tho