taichi Why is it so much slower to turn on a gpu than not

import taichi as ti

import numpy as np

ti.init(arch=ti.gpu) benchmark = True N = 15000 if benchmark: a_numpy = np.random.randint(0,100,N,dtype=np.int32) b_numpy = np.random.randint(0,100,N,dtype=np.int32) else: a_numpy = np.array([0,1,0,2,4,3,1,2,1],dtype=np.int32) b_numpy = np.array([4,0,1,4,5,3,1,2],dtype=np.int32) f = ti.field(dtype=ti.i32,shape=(N+1,N+1))

@ti.kernel def compute_lcs(a: ti.types.ndarray(),b: ti.types.ndarray()) -> ti.i32: len_a,len_b = a.shape[0],b.shape[0] ti.loop_config(serialize=True) for i in range(1,len_a + 1): for j in range(1,len_b + 1): f[i,j] = ti.max(f[i-1,j-1] + (a[i-1] == b[j-1]),ti.max(f[i-1,j],f[i,j -1])) return f[len_a,len_b]

print(compute_lcs(a_numpy,b_numpy)) The following is not to start the cpu <img width="528" alt="image" src="https://github.com/user-attachments/assets/03d62b88-11b2-4a14-a4b3-

Sep 11 '24 09:09 xzlinux

Part of the problem seems to be you're trying to write a serial algorithm on the GPU, your outer loop has ti.loop_config(serialize=True). For it to be fast on GPU your outer loop needs to be the parallel one.

Other things which can help (though I don't think are relevant here) is to use data which resides on GPU to avoid transfer, e.g. torch tensor on the device can avoid copying.

Nov 07 '24 11:11 oliver-batchelor

Hi just to add to @oliver-batchelor 's comment, the benefits of the GPU are only really felt with large data sizes. The CPU may actually be faster for small amounts of data because it has a faster clock. Basically the GPU is faster at doing massive amounts of small tasks in parallel, but for each individual task it will likely not be any faster than the CPU (and may be slower) Try increasing your data size by a lot and see how the performance changes :)

Jan 28 '25 22:01 HJWoods