taichi icon indicating copy to clipboard operation
taichi copied to clipboard

complex number support and basic methods

Open mzy2240 opened this issue 3 years ago • 21 comments

Sorry in case being dumb. I just wonder could Taichi support complex number as a data type, and support some relevant basic arithmetic and methods(like conjugate).

I'm trying to compute a partial jacobian matrix using Taichi (already have a parallel numba version to compare), but cannot find related example for complex vector op.

mzy2240 avatar Nov 21 '21 05:11 mzy2240

Hi @mzy2240 , this will be a useful feature :-) For now you can take a look into https://github.com/taichi-dev/taichi_glsl/blob/master/taichi_glsl/classes.py for how a custom Complex can be implemented. Ideally, we'd like to support this in our C++ type system (CC @strongoier)

k-ye avatar Nov 22 '21 07:11 k-ye

One update on this issue, also partially related to https://github.com/taichi-dev/taichi/issues/1206, I did a quick but probably unfair comparison between numba and taichi on a partial Jacobian matrix generation process (Im a power system researcher and the Jacobian matrix is a big thing in steady state power system analysis). You may say [ref]

In taichi you write the loops and operate on individual elements of a matrix

and yes you could absolutely write in a form of matrix multiplication, like below

from numpy import conj, diag, asarray
from scipy.sparse import issparse, csr_matrix as sparse
from numba import jit
from numpy import complex128, float64, int32
from numpy.core.multiarray import zeros, empty, array

def dSbus_dV(Ybus, V):
    Ibus = Ybus * V  # Ybus is a 2000*2000 sparse matrix and V is a 1-D array
    ib = range(len(V))
    diagV = sparse((V, (ib, ib)))
    diagIbus = sparse((Ibus, (ib, ib)))
    diagVnorm = sparse((V / abs(V), (ib, ib)))
    dS_dVm = diagV * conj(Ybus * diagVnorm) + conj(diagIbus) * diagVnorm
    dS_dVa = 1j * diagV * conj(diagIbus - Ybus * diagV)
    return dS_dVm, dS_dVa

but actually the fact is counter-intuitive: using numba to do element-wise operation is 5 times fast than doing matrix-level multiplication using numpy and scipy (the numba version could finish 1000 times execution in less than 0.65 seconds, with Intel i9-9880H). Here is the numba code I am using:

@jit(nopython=True, cache=False)
def dSbus_dV_numba_sparse(Yx, Yp, Yj, V, Vnorm, Ibus): 
    # init buffer vector
    buffer = zeros(len(V), dtype=complex128)
    dS_dVm = Yx.copy()
    dS_dVa = Yx.copy()

    # iterate through sparse matrix
    for r in range(len(Yp) - 1):
        for k in range(Yp[r], Yp[r + 1]):
            buffer[r] += Yx[k] * V[Yj[k]]
            dS_dVm[k] *= Vnorm[Yj[k]]
            dS_dVa[k] *= V[Yj[k]]

        Ibus[r] += buffer[r]
        buffer[r] = conj(buffer[r]) * Vnorm[r]

    for r in range(len(Yp) - 1):
        for k in range(Yp[r], Yp[r + 1]):
            dS_dVm[k] = conj(dS_dVm[k]) * V[r]

            if r == Yj[k]:
                # diagonal elements
                dS_dVa[k] = -Ibus[r] + dS_dVa[k]
                dS_dVm[k] += buffer[r]

            dS_dVa[k] = conj(-dS_dVa[k]) * (1j * V[r])

    return dS_dVm, dS_dVa


def dSbus_dV_numba(Ybus, V, I=None):

    if issparse(Ybus):
        I = zeros(len(V), dtype=complex128) if I is None else -I
        dS_dVm, dS_dVa = dSbus_dV_numba_sparse(Ybus.data, Ybus.indptr, Ybus.indices, V, V / abs(V), I)
        return sparse((dS_dVm, Ybus.indices, Ybus.indptr)), sparse((dS_dVa, Ybus.indices, Ybus.indptr))
    else:
        I = zeros(len(V), dtype=complex128) if I is None else I
        return dSbus_dV_dense(Ybus, V, I)

Since numba is doing the element-wise operation on matrix, I were thinking maybe I could also give a try using Taichi, and here is my script: (this is literally my second Taichi program so forgive me if it is not optimized)

ti.init(arch=ti.cpu, packed=True, debug=False)

@ti.func
def first_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa):
    for k in range(Yp[r], Yp[r+1]):
        buffer[r] += Yx[k] * V[Yj[k]]
        dS_dVm[k] *= Vnorm[Yj[k]]
        dS_dVa[k] *= V[Yj[k]]
        
@ti.func
def second_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa, Ibus):
    for k in range(Yp[r], Yp[r+1]):
        dS_dVm[k] *= V[r]
        if r == Yj[k]:
            dS_dVa[k] -= Ibus[r]
            dS_dVm[k] += buffer[r]
        dS_dVa[k] = -dS_dVa[k] * V[r]
        

@ti.kernel
def first_loop(Yp: ti.template(), Yx: ti.template(), V: ti.template(), Yj: ti.template(), Vnorm: ti.template(), buffer: ti.template(), dS_dVm: ti.template(), dS_dVa: ti.template(), Ibus: ti.template()):
    for r in range(Yp.shape[0]-1):
        first_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa)
        Ibus[r] += buffer[r]
        buffer[r] *= Vnorm[r]
        
@ti.kernel
def second_loop(Yp: ti.template(), Yx: ti.template(), V: ti.template(), Yj: ti.template(), Vnorm: ti.template(), buffer: ti.template(), dS_dVm: ti.template(), dS_dVa: ti.template(), Ibus: ti.template()):
    for r in range(Yp.shape[0]-1):
        second_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa, Ibus)
        

def dSbus_dV_taichi(Ybus, V):
    I = ti.field(dtype=float, shape=len(V))
    data = Ybus.data.astype(float64)
    Yp = ti.field(dtype=int, shape=len(Ybus.indptr))
    Yp.from_numpy(Ybus.indptr)
    Yj = ti.field(dtype=int, shape=len(Ybus.indices))
    Yj.from_numpy(Ybus.indices)
    Vt = ti.field(dtype=float, shape=len(V))
    Vt.from_numpy(V)
    Vnorm = ti.field(dtype=float, shape=len(V))
    Vnorm.from_numpy(V / abs(V))
    buffer = ti.field(dtype=float, shape=len(V))
    dS_dVm = ti.field(dtype=float, shape=len(data))
    dS_dVa = ti.field(dtype=float, shape=len(data))
    Yx = ti.field(dtype=float, shape=len(data))
    dS_dVm.from_numpy(data)
    dS_dVa.from_numpy(data)
    Yx.from_numpy(data)
    first_loop(Yp, Yx, Vt, Yj, Vnorm, buffer, dS_dVm, dS_dVa, I)
    second_loop(Yp, Yx, Vt, Yj, Vnorm, buffer, dS_dVm, dS_dVa, I)
    return sparse((dS_dVm.to_numpy(), Ybus.indices, Ybus.indptr)), sparse((dS_dVa.to_numpy(), Ybus.indices, Ybus.indptr))

Surprisingly the development process went smooth, however the performance is not good by all means. The taichi code is about 1000 times slower than numba (I removed the imaginary part so should be in the favor of Taichi I guess). I am not saying Taichi is really 1000 times slower than numba or numpy, and even though I attended all the Taichi classes about advanced data layout, sparse matrix and performance optimization, I still have no idea how to optimize the code in this usage, especially how to efficiently operate each element, exactly like how numba code does. So my question is, do you think it is a fair comparison (on this specific scope)? And how should I optimize the taichi script to achieve better results than numba?

Thanks and sorry if it is not relevant to the issue.

Update: Forgot to mention a few "optimization" I did for the taichi code: 1 TI_CACHE_RUNTIME_BITCODE is enabled 2 padding is disabled (did not see a reason why in the scope padding is needed) 3 only one iteration in a single kernel/function All the tests and profiling are carried in Python 3.8 with the latest Taichi 0.8.6.

mzy2240 avatar Nov 24 '21 05:11 mzy2240

What's the typical shape of Ybus and V?

bobcao3 avatar Nov 24 '21 05:11 bobcao3

Hi, would you mind update the full code s.t. we can also run some tests locally?

qiao-bo avatar Nov 24 '21 05:11 qiao-bo

Hi, would you mind update the full code s.t. we can also run some tests locally?

Or I think a sample data or data generator of Ybus and V values will be good

bobcao3 avatar Nov 24 '21 05:11 bobcao3

Here is the zipped npz file for the Ybus matrix. You could load using scipy.sparse.load_npz after decompression. Ybus.zip For the V array, you could generate like this:

V = np.ones(2000)

Ybus is a 2000*2000 sparse matrix and V is a 1D array with length 2000. 2000 is definitely not a small size for power system in terms of buses, but it is also not something too large to be practical considering the largest case we have exceeds 100,000 buses (which means the Ybus is 100,000 * 100,000). For the sparsity, it is about 0.5%.

mzy2240 avatar Nov 24 '21 06:11 mzy2240

In case you guys want to test, I just updated the original post with all the imports I am using in the code.

mzy2240 avatar Nov 24 '21 06:11 mzy2240

Hi @mzy2240, there are a few findings I can share regarding to the 1000x performance diff ;/. I believe you were comparing the performance of an after-compilation Numba version with a with-compilation Taichi version. Here are some details:

First, the Numba code i measured (some small modifications based on the above code just to make it executable):

from numba import jit
import numpy as np
import time
from scipy import sparse

@jit(nopython=True, cache=False)
def dSbus_dV_numba_sparse(Yx, Yp, Yj, V, Vnorm, Ibus): 
    # init buffer vector
    buffer = np.zeros(len(V), dtype=np.complex128)
    dS_dVm = Yx.copy()
    dS_dVa = Yx.copy()

    # iterate through sparse matrix
    for r in range(len(Yp) - 1):
        for k in range(Yp[r], Yp[r + 1]):
            buffer[r] += Yx[k] * V[Yj[k]]
            dS_dVm[k] *= Vnorm[Yj[k]]
            dS_dVa[k] *= V[Yj[k]]

        Ibus[r] += buffer[r]
        buffer[r] = np.conj(buffer[r]) * Vnorm[r]

    for r in range(len(Yp) - 1):
        for k in range(Yp[r], Yp[r + 1]):
            dS_dVm[k] = np.conj(dS_dVm[k]) * V[r]

            if r == Yj[k]:
                # diagonal elements
                dS_dVa[k] = -Ibus[r] + dS_dVa[k]
                dS_dVm[k] += buffer[r]

            dS_dVa[k] = np.conj(-dS_dVa[k]) * (1j * V[r])

    return dS_dVm, dS_dVa


def dSbus_dV_numba(Ybus, V, I=None):
    if sparse.issparse(Ybus):
        I = np.zeros(len(V), dtype=np.complex128) if I is None else -I
        dS_dVm, dS_dVa = dSbus_dV_numba_sparse(Ybus.data, Ybus.indptr, Ybus.indices, V, V / abs(V), I)
        return sparse.csr_matrix((dS_dVm, Ybus.indices, Ybus.indptr)), sparse.csr_matrix((dS_dVa, Ybus.indices, Ybus.indptr))
    else:
        I = np.zeros(len(V), dtype=np.complex128) if I is None else I
        return dSbus_dV_dense(Ybus, V, I)


Ybus = sparse.load_npz('Ybus.npz')
V = np.ones(2000)

start = time.time()
dSbus_dV_numba(Ybus, V)
end = time.time()
print("Elapsed (with compilation) = %s" % (end - start))

start = time.time()
dSbus_dV_numba(Ybus, V)
end = time.time()
print("Elapsed (after compilation) = %s" % (end - start))

As suggested by How to measure the performance of Numba? On my local PC with i9-11900K. This code takes

Elapsed (with compilation) = 0.3006782531738281
Elapsed (after compilation) = 0.0001761913299560547

Now the Taichi code you posted:

import taichi as ti
import numpy as np
import time
from scipy import sparse

ti.init(arch=ti.cpu, debug=False)

@ti.func
def first_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa):
    for k in range(Yp[r], Yp[r+1]):
        buffer[r] += Yx[k] * V[Yj[k]]
        dS_dVm[k] *= Vnorm[Yj[k]]
        dS_dVa[k] *= V[Yj[k]]
        
@ti.func
def second_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa, Ibus):
    for k in range(Yp[r], Yp[r+1]):
        dS_dVm[k] *= V[r]
        if r == Yj[k]:
            dS_dVa[k] -= Ibus[r]
            dS_dVm[k] += buffer[r]
        dS_dVa[k] = -dS_dVa[k] * V[r]
        

@ti.kernel
def first_loop(Yp: ti.template(), Yx: ti.template(), V: ti.template(), Yj: ti.template(), Vnorm: ti.template(), buffer: ti.template(), dS_dVm: ti.template(), dS_dVa: ti.template(), Ibus: ti.template()):
    for r in range(Yp.shape[0]-1):
        first_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa)
        Ibus[r] += buffer[r]
        buffer[r] *= Vnorm[r]
        
@ti.kernel
def second_loop(Yp: ti.template(), Yx: ti.template(), V: ti.template(), Yj: ti.template(), Vnorm: ti.template(), buffer: ti.template(), dS_dVm: ti.template(), dS_dVa: ti.template(), Ibus: ti.template()):
    for r in range(Yp.shape[0]-1):
        second_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa, Ibus)
        


def dSbus_dV_taichi(Ybus, V):
    I = ti.field(dtype=float, shape=len(V))
    data = Ybus.data.astype(np.float64)
    Yp = ti.field(dtype=int, shape=len(Ybus.indptr))
    Yp.from_numpy(Ybus.indptr)
    Yj = ti.field(dtype=int, shape=len(Ybus.indices))
    Yj.from_numpy(Ybus.indices)
    Vt = ti.field(dtype=float, shape=len(V))
    Vt.from_numpy(V)
    Vnorm = ti.field(dtype=float, shape=len(V))
    Vnorm.from_numpy(V / abs(V))
    buffer = ti.field(dtype=float, shape=len(V))
    dS_dVm = ti.field(dtype=float, shape=len(data))
    dS_dVa = ti.field(dtype=float, shape=len(data))
    Yx = ti.field(dtype=float, shape=len(data))
    dS_dVm.from_numpy(data)
    dS_dVa.from_numpy(data)
    Yx.from_numpy(data)
    first_loop(Yp, Yx, Vt, Yj, Vnorm, buffer, dS_dVm, dS_dVa, I)
    second_loop(Yp, Yx, Vt, Yj, Vnorm, buffer, dS_dVm, dS_dVa, I)
    return sparse.csr_matrix((dS_dVm.to_numpy(), Ybus.indices, Ybus.indptr)), sparse.csr_matrix((dS_dVa.to_numpy(), Ybus.indices, Ybus.indptr))


Ybus = sparse.load_npz('Ybus.npz')
V = np.ones(2000)


start = time.time()
dSbus_dV_taichi(Ybus, V)
end = time.time()
print("Elapsed (with compilation) = %s" % (end - start))

start = time.time()
dSbus_dV_taichi(Ybus, V)
end = time.time()
print("Elapsed (after compilation) = %s" % (end - start))

If we just replicate the above measurement, I got

Elapsed (with compilation) = 0.30458903312683105
Elapsed (after compilation) = 0.3131248950958252

At this point, If i compare the two after-compilation version, Numba is 1777 times faster ;/ However, as can be seen from the two execution times in Taichi measurement, the second run is not cached at all.

Why not cached? Actually, Taichi JIT compiler and Numba JIT has similarities that we all cache the function based on the arguments type. So if it is called again, the cached version can be reused instead of having to recompile. However, in Taichi, when ti.template() is used as parameter type, we cache by value instead of by type. Now looking back to the Taichi code, in the def dSbus_dV_taichi(Ybus, V): function, all fields are re-declared. Hence, you won't get a cached kernel returned.

How to optimize? It is quite simple, we can make all field decls global, such that kernels can be seen to have same signatures when being recompiled. Here is the code:

import taichi as ti
import time
import numpy as np
from scipy import sparse

ti.init(arch=ti.cpu, debug=False)

@ti.func
def first_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa):
    for k in range(Yp[r], Yp[r+1]):
        buffer[r] += Yx[k] * V[Yj[k]]
        dS_dVm[k] *= Vnorm[Yj[k]]
        dS_dVa[k] *= V[Yj[k]]
        
@ti.func
def second_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa, Ibus):
    for k in range(Yp[r], Yp[r+1]):
        dS_dVm[k] *= V[r]
        if r == Yj[k]:
            dS_dVa[k] -= Ibus[r]
            dS_dVm[k] += buffer[r]
        dS_dVa[k] = -dS_dVa[k] * V[r]
        

@ti.kernel
def first_loop(Yp: ti.template(), Yx: ti.template(), V: ti.template(), Yj: ti.template(), Vnorm: ti.template(), buffer: ti.template(), dS_dVm: ti.template(), dS_dVa: ti.template(), Ibus: ti.template()):
    for r in range(Yp.shape[0]-1):
        first_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa)
        Ibus[r] += buffer[r]
        buffer[r] *= Vnorm[r]
        
@ti.kernel
def second_loop(Yp: ti.template(), Yx: ti.template(), V: ti.template(), Yj: ti.template(), Vnorm: ti.template(), buffer: ti.template(), dS_dVm: ti.template(), dS_dVa: ti.template(), Ibus: ti.template()):
    for r in range(Yp.shape[0]-1):
        second_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa, Ibus)
        
# load
Ybus = sparse.load_npz('Ybus.npz')
data = Ybus.data.astype(np.float64)
V = np.ones(2000)

# decl
I = ti.field(dtype=float, shape=len(V))
Yp = ti.field(dtype=int, shape=len(Ybus.indptr))
Yj = ti.field(dtype=int, shape=len(Ybus.indices))
Vt = ti.field(dtype=float, shape=len(V))
Vnorm = ti.field(dtype=float, shape=len(V))
buffer = ti.field(dtype=float, shape=len(V))
dS_dVm = ti.field(dtype=float, shape=len(data))
dS_dVa = ti.field(dtype=float, shape=len(data))
Yx = ti.field(dtype=float, shape=len(data))

def dSbus_dV_taichi():
    # inits
    Yp.from_numpy(Ybus.indptr)
    Yj.from_numpy(Ybus.indices)
    Vt.from_numpy(V)
    Vnorm.from_numpy(V / abs(V))
    dS_dVm.from_numpy(data)
    dS_dVa.from_numpy(data)
    Yx.from_numpy(data)

    first_loop(Yp, Yx, Vt, Yj, Vnorm, buffer, dS_dVm, dS_dVa, I)
    second_loop(Yp, Yx, Vt, Yj, Vnorm, buffer, dS_dVm, dS_dVa, I)
    return sparse.csr_matrix((dS_dVm.to_numpy(), Ybus.indices, Ybus.indptr)), sparse.csr_matrix((dS_dVa.to_numpy(), Ybus.indices, Ybus.indptr))


start = time.time()
dSbus_dV_taichi()
end = time.time()
print("Elapsed (with compilation) = %s" % (end - start))


start = time.time()
dSbus_dV_taichi()
end = time.time()
print("Elapsed (after compilation) = %s" % (end - start))

Now, the results are more comparable:

Elapsed (with compilation) = 0.2619659900665283
Elapsed (after compilation) = 0.0010476112365722656

But, Taichi is still 6 times slower. Actually, the from_numpy conversions are not necessary in this case. Since in Taichi, we support direct operations using external arrays such as numpy or torch. This is a quickly hand-crafted version only using numpy arrays:

import taichi as ti
import numpy as np
import time
from scipy import sparse

ti.init(arch=ti.cpu, debug=False)

@ti.func
def first_inner_loop(r, Yp: ti.template(), Yx: ti.template(), V: ti.template(), Yj: ti.template(), Vnorm: ti.template(), buffer: ti.template(), dS_dVm: ti.template(), dS_dVa: ti.template()):
    for k in range(Yp[r], Yp[r+1]):
        buffer[r] += Yx[k] * V[Yj[k]]
        dS_dVm[k] *= Vnorm[Yj[k]]
        dS_dVa[k] *= V[Yj[k]]
        
@ti.func
def second_inner_loop(r, Yp: ti.template(), Yx: ti.template(), V: ti.template(), Yj: ti.template(), Vnorm: ti.template(), buffer: ti.template(), dS_dVm: ti.template(), dS_dVa: ti.template(), Ibus: ti.template()):
    for k in range(Yp[r], Yp[r+1]):
        dS_dVm[k] *= V[r]
        if r == Yj[k]:
            dS_dVa[k] -= Ibus[r]
            dS_dVm[k] += buffer[r]
        dS_dVa[k] = -dS_dVa[k] * V[r]
        
@ti.kernel
def two_loops(Yp: ti.any_arr(), Yx: ti.any_arr(), V: ti.any_arr(), Yj: ti.any_arr(), Vnorm: ti.any_arr(), buffer: ti.any_arr(), dS_dVm: ti.any_arr(), dS_dVa: ti.any_arr(), Ibus: ti.any_arr()):
    for r in range(Yp.shape[0]-1):
        first_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa)
        Ibus[r] += buffer[r]
        buffer[r] *= Vnorm[r]
    for r in range(Yp.shape[0]-1):
        second_inner_loop(r, Yp, Yx, V, Yj, Vnorm, buffer, dS_dVm, dS_dVa, Ibus)
        
def dSbus_dV_taichi(Ybus, V):
    dS_dVm = Ybus.data.astype(np.float64)
    dS_dVa = Ybus.data.astype(np.float64)
    buffer = np.ones(dtype=float, shape=len(V))
    I = np.ones(dtype=float, shape=len(V))

    two_loops(Ybus.indptr, data, V, Ybus.indices, Vnorm, buffer, dS_dVm, dS_dVa, I)
    #second_loop(Ybus.indptr, data, V, Ybus.indices, Vnorm, buffer, dS_dVm, dS_dVa, I)
    return sparse.csr_matrix((dS_dVm, Ybus.indices, Ybus.indptr)), sparse.csr_matrix((dS_dVa, Ybus.indices, Ybus.indptr))


Ybus = sparse.load_npz('Ybus.npz')
V = np.ones(2000)
Vnorm = V / abs(V)
data = Ybus.data.astype(np.float64)

start = time.time()
dSbus_dV_taichi(Ybus, V)
end = time.time()
print("Elapsed (with compilation) = %s" % (end - start))

start = time.time()
dSbus_dV_taichi(Ybus, V)
end = time.time()
print("Elapsed (after compilation) = %s" % (end - start))

Above code gives

Elapsed (with compilation) = 0.04335379600524902
Elapsed (after compilation) = 0.0003981590270996094

We are still around ~2x slower than Numba ;), but hopefully this gives you idea of how to proceed. Let us know if there are further questions. Comments/Suggestions are welcome.

qiao-bo avatar Dec 17 '21 04:12 qiao-bo

Wow this is definitely a tutorial-level answer and should be re-used in the document to maximize its value! Thank you for the detailed explanation @qiao-bo !

mzy2240 avatar Dec 17 '21 05:12 mzy2240

Now it seems the major part left is the official support for complex number and its basic arithmetic.

mzy2240 avatar Dec 17 '21 05:12 mzy2240

In case someone is interested to test the complex array version (a big step toward applying taichi in power system analysis!), here is the updated code that is modified based on the above code (thanks to @qiao-bo ) and supports complex numbers. Surprisingly, this version is faster than numba no matter with compilation or after compilation in the test environment (Intel i9-9980H Win10 Taichi 0.88). The code looks a little bit verbose due to the lack of native support of complex numbers and arrays, but at least it works and performs well. Of course, please let me know if you find anything that could use an optimization.

 import taichi as ti
import numpy as np
from scipy import sparse
from copy import deepcopy

ti.init(arch=ti.cpu, debug=False)

@ti.func
def first_inner_loop(r, Yp: ti.template(), Yx_real: ti.template(), Yx_imag: ti.template(), V_real: ti.template(), V_imag: ti.template(), Yj: ti.template(), Vnorm_real: ti.template(), Vnorm_imag: ti.template(), buffer_real: ti.template(), buffer_imag: ti.template(), dS_dVm_real: ti.template(), dS_dVm_imag: ti.template(), dS_dVa_real: ti.template(), dS_dVa_imag: ti.template()):
    for k in range(Yp[r], Yp[r+1]):
        buffer_real[r] += Yx_real[k] * V_real[Yj[k]] - Yx_imag[k] * V_imag[Yj[k]]
        buffer_imag[r] += Yx_real[k] * V_imag[Yj[k]] + Yx_imag[k] * V_real[Yj[k]]
        dS_dVm_real[k] = dS_dVm_real[k] * Vnorm_real[Yj[k]] - dS_dVm_imag[k] * Vnorm_imag[Yj[k]]
        dS_dVm_imag[k] = dS_dVm_real[k] * Vnorm_imag[Yj[k]] + dS_dVm_imag[k] * Vnorm_real[Yj[k]]
        dS_dVa_real[k] = dS_dVa_real[k] * V_real[Yj[k]] - dS_dVa_imag[k] * V_imag[Yj[k]]
        dS_dVa_imag[k] = dS_dVa_real[k] * V_imag[Yj[k]] + dS_dVa_imag[k] * V_real[Yj[k]]

@ti.func
def second_inner_loop(r, Yp: ti.template(), V_real: ti.template(), V_imag: ti.template(), Yj: ti.template(), buffer_real: ti.template(), buffer_imag: ti.template(), dS_dVm_real: ti.template(), dS_dVm_imag: ti.template(), dS_dVa_real: ti.template(), dS_dVa_imag: ti.template(), Ibus_real: ti.template(), Ibus_imag: ti.template()):
    for k in range(Yp[r], Yp[r+1]):
        dS_dVm_real[k] = dS_dVm_real[k] * V_real[r] + dS_dVm_imag[k] * V_imag[r]
        dS_dVm_imag[k] = dS_dVm_real[k] * V_imag[r] - dS_dVm_imag[k] * V_real[r]
        if r == Yj[k]:
            dS_dVa_real[k] = - Ibus_real[r] + dS_dVa_real[k]
            dS_dVa_imag[k] = - Ibus_imag[r] + dS_dVa_imag[k]
            dS_dVm_real[k] += buffer_real[r]
            dS_dVm_imag[k] += buffer_imag[r]
        dS_dVa_real_temp = dS_dVa_real[k]
        dS_dVa_imag_temp = dS_dVa_imag[k]
        dS_dVa_imag[k] = -dS_dVa_real_temp * V_real[r] - dS_dVa_imag_temp * V_imag[r]
        dS_dVa_real[k] = dS_dVa_real_temp * V_imag[r] - dS_dVa_imag_temp * V_real[r]


@ti.kernel
def two_loops(Yp: ti.any_arr(), Yx_real: ti.any_arr(), Yx_imag: ti.any_arr(), V_real: ti.any_arr(), V_imag: ti.any_arr(), Yj: ti.any_arr(), Vnorm_real: ti.any_arr(), Vnorm_imag: ti.any_arr(), buffer_real: ti.any_arr(), buffer_imag: ti.any_arr(), dS_dVm_real: ti.any_arr(), dS_dVm_imag: ti.any_arr(), dS_dVa_real: ti.any_arr(), dS_dVa_imag: ti.any_arr(), Ibus_real: ti.any_arr(), Ibus_imag: ti.any_arr()):
    for r in range(Yp.shape[0]-1):
        first_inner_loop(r, Yp, Yx_real, Yx_imag, V_real, V_imag, Yj, Vnorm_real, Vnorm_imag, buffer_real, buffer_imag, dS_dVm_real, dS_dVm_imag, dS_dVa_real, dS_dVa_imag)
        Ibus_real[r] += buffer_real[r]
        Ibus_imag[r] += buffer_imag[r]
        buffer_real[r] = buffer_real[r] * Vnorm_real[r] + buffer_imag[r] * Vnorm_imag[r]
        buffer_imag[r] = buffer_real[r] * Vnorm_imag[r] - buffer_imag[r] * Vnorm_real[r]
    for r in range(Yp.shape[0]-1):
        second_inner_loop(r, Yp, V_real, V_imag, Yj, buffer_real, buffer_imag, dS_dVm_real, dS_dVm_imag, dS_dVa_real, dS_dVa_imag, Ibus_real, Ibus_imag)

def dSbus_dV_taichi(Ybus, V):
    dS_dVm_real = deepcopy(Ybus.data.real)
    dS_dVm_imag = deepcopy(Ybus.data.imag)
    dS_dVa_real = deepcopy(Ybus.data.real)
    dS_dVa_imag = deepcopy(Ybus.data.imag)
    buffer_real = np.zeros(dtype=float, shape=len(V))
    buffer_imag = np.zeros(dtype=float, shape=len(V))
    I_real = np.zeros(dtype=float, shape=len(V))
    I_imag = np.zeros(dtype=float, shape=len(V))
    # V = np.ones(2000)
    Vnorm = V / abs(V)
    Vnorm_real = Vnorm.real
    Vnorm_imag = Vnorm.imag
    V_real = V.real
    V_imag = V.imag
    data_real = deepcopy(Ybus.data.real)
    data_imag = deepcopy(Ybus.data.imag)

    two_loops(Ybus.indptr, data_real, data_imag, V_real, V_imag, Ybus.indices, Vnorm_real, Vnorm_imag, buffer_real, buffer_imag, dS_dVm_real, dS_dVm_imag, dS_dVa_real, dS_dVa_imag, I_real, I_imag)
    #second_loop(Ybus.indptr, data, V, Ybus.indices, Vnorm, buffer, dS_dVm, dS_dVa, I)
    dS_dVm = np.empty(len(dS_dVm_real), dtype=np.complex128)
    dS_dVm.real = dS_dVm_real
    dS_dVm.imag = dS_dVm_imag
    dS_dVa = np.empty(len(dS_dVm_real), dtype=np.complex128)
    dS_dVa.real = dS_dVa_real
    dS_dVa.imag = dS_dVa_imag
    return sparse.csr_matrix((dS_dVm, Ybus.indices, Ybus.indptr)), sparse.csr_matrix((dS_dVa, Ybus.indices, Ybus.indptr))

Ybus = sparse.load_npz('Ybus.npz')  # Ybus is a complex128 sparse matrix
V = np.ones(2000)

import time
start = time.time()
for i in range(1):
    dSbus_dV_taichi(Ybus, V)
end = time.time()
print("Elapsed (with compilation) = %s" % (end - start))

start = time.time()
dSbus_dV_taichi(Ybus, V)
end = time.time()
print("Elapsed (after compilation) = %s" % (end - start))

mzy2240 avatar Dec 25 '21 06:12 mzy2240

Waiting for official support on complex numbers as well. Now I resort to 2D vec fields since I don't need complex number multiplication.

ifsheldon avatar Dec 30 '21 13:12 ifsheldon

Update:

I found out a serious bug in the example I gave above. Since I am computing the real and imaginary parts of each variable separately, it is important to freeze the values before computing if the value depends on both real and imaginary parts. I am not sure whether it is the most elegant and performant way to do it, but it works and the result exactly matches the numpy and numba version for complex jacobian matrix.

Original:

# As an example, other functions have same problem
@ti.func
def first_inner_loop(r, Yp: ti.template(), Yx_real: ti.template(), Yx_imag: ti.template(), V_real: ti.template(), V_imag: ti.template(), Yj: ti.template(), Vnorm_real: ti.template(), Vnorm_imag: ti.template(), buffer_real: ti.template(), buffer_imag: ti.template(), dS_dVm_real: ti.template(), dS_dVm_imag: ti.template(), dS_dVa_real: ti.template(), dS_dVa_imag: ti.template()):
    for k in range(Yp[r], Yp[r+1]):
        buffer_real[r] += Yx_real[k] * V_real[Yj[k]] - Yx_imag[k] * V_imag[Yj[k]]
        buffer_imag[r] += Yx_real[k] * V_imag[Yj[k]] + Yx_imag[k] * V_real[Yj[k]]
        dS_dVm_real[k] = dS_dVm_real[k] * Vnorm_real[Yj[k]] - dS_dVm_imag[k] * Vnorm_imag[Yj[k]]
        dS_dVm_imag[k] = dS_dVm_real[k] * Vnorm_imag[Yj[k]] + dS_dVm_imag[k] * Vnorm_real[Yj[k]]
        dS_dVa_real[k] = dS_dVa_real[k] * V_real[Yj[k]] - dS_dVa_imag[k] * V_imag[Yj[k]]
        dS_dVa_imag[k] = dS_dVa_real[k] * V_imag[Yj[k]] + dS_dVa_imag[k] * V_real[Yj[k]]

Corrected:

@ti.func
def first_inner_loop(r, Yp: ti.template(), Yx_real: ti.template(), Yx_imag: ti.template(), V_real: ti.template(), V_imag: ti.template(), Yj: ti.template(), Vnorm_real: ti.template(), Vnorm_imag: ti.template(), buffer_real: ti.template(), buffer_imag: ti.template(), dS_dVm_real: ti.template(), dS_dVm_imag: ti.template(), dS_dVa_real: ti.template(), dS_dVa_imag: ti.template()):
    for k in range(Yp[r], Yp[r+1]):
        buffer_real[r] += Yx_real[k] * V_real[Yj[k]] - Yx_imag[k] * V_imag[Yj[k]]
        buffer_imag[r] += Yx_real[k] * V_imag[Yj[k]] + Yx_imag[k] * V_real[Yj[k]]
        dS_dVm_real_temp = dS_dVm_real[k]  # freeze dS_dVm[k]
        dS_dVm_imag_temp = dS_dVm_imag[k]
        dS_dVm_real[k] = dS_dVm_real_temp * Vnorm_real[Yj[k]] - dS_dVm_imag_temp * Vnorm_imag[Yj[k]]
        dS_dVm_imag[k] = dS_dVm_real_temp * Vnorm_imag[Yj[k]] + dS_dVm_imag_temp * Vnorm_real[Yj[k]]
        dS_dVa_real_temp = dS_dVa_real[k]  # freeze dS_dVa[k]
        dS_dVa_imag_temp = dS_dVa_imag[k]
        dS_dVa_real[k] = dS_dVa_real_temp * V_real[Yj[k]] - dS_dVa_imag_temp * V_imag[Yj[k]]
        dS_dVa_imag[k] = dS_dVa_real_temp * V_imag[Yj[k]] + dS_dVa_imag_temp * V_real[Yj[k]]

Another update: The correction seems to make the performance noticeably worse, unfortunately. Wondering if there is a better way to do that.

------------------------------------------------------Update----------------------------------------------------------- The above corrected version could be further improved, cause only the real part needs freeze.

@ti.func
def first_inner_loop(r, Yp: ti.template(), Yx_real: ti.template(), Yx_imag: ti.template(), V_real: ti.template(), V_imag: ti.template(), Yj: ti.template(), Vnorm_real: ti.template(), Vnorm_imag: ti.template(), buffer_real: ti.template(), buffer_imag: ti.template(), dS_dVm_real: ti.template(), dS_dVm_imag: ti.template(), dS_dVa_real: ti.template(), dS_dVa_imag: ti.template()):
    for k in range(Yp[r], Yp[r+1]):
        buffer_real[r] += Yx_real[k] * V_real[Yj[k]] - Yx_imag[k] * V_imag[Yj[k]]
        buffer_imag[r] += Yx_real[k] * V_imag[Yj[k]] + Yx_imag[k] * V_real[Yj[k]]
        dS_dVm_real_temp = dS_dVm_real[k]  # freeze dS_dVm[k]
        dS_dVm_real[k] = dS_dVm_real_temp * Vnorm_real[Yj[k]] - dS_dVm_imag[k] * Vnorm_imag[Yj[k]]
        dS_dVm_imag[k] = dS_dVm_real_temp * Vnorm_imag[Yj[k]] + dS_dVm_imag[k] * Vnorm_real[Yj[k]]
        dS_dVa_real_temp = dS_dVa_real[k]  # freeze dS_dVa[k]
        dS_dVa_real[k] = dS_dVa_real_temp * V_real[Yj[k]] - dS_dVa_imag[k] * V_imag[Yj[k]]
        dS_dVa_imag[k] = dS_dVa_real_temp * V_imag[Yj[k]] + dS_dVa_imag[k] * V_real[Yj[k]]

mzy2240 avatar Jan 12 '22 17:01 mzy2240

@qiao-bo In the above example you gave, the external numpy array are directly passed to the taichi kernel, and I assume they are in soa pattern. What if I want to pass those numpy arrays in array of structs (aos) pattern (let's say they all have the same length) so that I could better utilize cache for better efficiency, what might be the most efficient (best?) way to convert the numpy array into aos in taichi kernel? How much performance gain should we expect? Thanks!

mzy2240 avatar Jan 29 '22 23:01 mzy2240

@mzy2240, how much performance degradation did you observe for the corrected code? (BTW, the imag part seems do not need any freeze, only the real part is needed?).

Regarding to the AOS/SOA pattern, since these arrays are passed as independent parameters, we can assume neither AOS nor SOA. If you want to explicitly control the memory layout, you have to use field. Taichi field is designed to allow flexible data layout composition. check out the docs here.

qiao-bo avatar Feb 07 '22 03:02 qiao-bo

@mzy2240, how much performance degradation did you observe for the corrected code? (BTW, the imag part seems do not need any freeze, only the real part is needed?).

I observed a 10X degradation and you are right, only the real part needs freeze. (so the actual performance difference for the same function between taichi and numba will be somewhere between 1 and 10 I guess)

Regarding to the AOS/SOA pattern, since these arrays are passed as independent parameters, we can assume neither AOS nor SOA. If you want to explicitly control the memory layout, you have to use field. Taichi field is designed to allow flexible data layout composition. check out the docs here.

Does that mean in order to use the field, I have to do the explicit conversion first which will unfortunately introduce overhead again? If we choose to neglect the conversion overhead, how much performance gain could be expected by using AOS over SOA in the above example?

mzy2240 avatar Feb 07 '22 03:02 mzy2240

I observed a 10X degradation and you are right, only the real part needs freeze. (so the actual performance difference for the same function between taichi and numba will be somewhere between 1 and 10 I guess)

that is a bit strange. I will have a look at this and come back. The freeze part is the only modification compared to the previous code, right?

Does that mean in order to use the field, I have to do the explicit conversion first which will unfortunately introduce overhead again? If we choose to neglect the conversion overhead, how much performance gain could be expected by using AOS over SOA in the above example?

yep, Field requires copy from Numpy, that is indeed some overhead. It's really hard to give a number here ;/ since the speedup depends on the cache size, array length, etc. i guess we will have to try and see ;/

qiao-bo avatar Feb 07 '22 03:02 qiao-bo

that is a bit strange. I will have a look at this and come back. The freeze part is the only modification compared to the previous code, right?

Yes freeze part is the only modification, but the above modified single function is just an example, I did the same modification for all the functions. And yes I feel surprised as well when I had the results.

yep, Field requires copy from Numpy, that is indeed some overhead. It's really hard to give a number here ;/ since the speedup depends on the cache size, array length, etc. i guess we will have to try and see ;/

Yeah cache size and array length definitely will have an impact on the performance. The reason why I ask is that I feel the above functions might still be computation-bounded, so savings on data access may not be that noticeable.

mzy2240 avatar Feb 07 '22 03:02 mzy2240

Hi, first sorry for the late reply. Regarding to the performance difference, I just ran the previous code with and without "freeze dS_dVm[k]". On my local PC with i9-11900K there seems not much differences. I got something like

without

[Taichi] version 0.9.1, llvm 10.0.0, commit e9e15196, linux, python 3.8.12                                                     
[Taichi] Starting on arch=x64                                                                                                  
Elapsed (with compilation) = 0.05987668037414551                                                                               
Elapsed (after compilation) = 0.0004096031188964844                                                                                                                                                  

with

[Taichi] version 0.9.1, llvm 10.0.0, commit e9e15196, linux, python 3.8.12                                                     
[Taichi] Starting on arch=x64                                                                                                  
Elapsed (with compilation) = 0.05990171432495117                                                                               
Elapsed (after compilation) = 0.0004048347473144531                                                                            

can you maybe try again if the performance differences still exit?

qiao-bo avatar Feb 25 '22 01:02 qiao-bo

This issue seems to be ignored for a long time. Do we have any updates?

ifsheldon avatar Sep 20 '22 18:09 ifsheldon

@ifsheldon I can see they are adding some support on basic complex operations, not sure when the complex number types will be supported tho

mzy2240 avatar Sep 20 '22 18:09 mzy2240