taichi
taichi copied to clipboard
ndrange is limited in size
Describe the bug
When using nd-indexing (with ti.ndrange
), all indices are not computed. For instance, ti.ndrange(M, M)
with M = 100000 returns i indices between 1 and 14100 only.
To Reproduce
import torch
import taichi as ti
import taichi.math as tm
ti.init(arch=ti.gpu)
@ti.kernel
def sum_exp(supp_x: ti.types.ndarray(), supp_y: ti.types.ndarray(), out: ti.types.ndarray()): # A kernel
for i, j in ti.ndrange(supp_x.shape[0], supp_y.shape[0]):
out[i] += tm.exp(-ti.pow(supp_y[j] - supp_x[i], 2)/2.)
def sum_exp_torch(x, y):
out = torch.zeros(x.shape[0]).to(x.device)
sum_exp(x, y, out)
return out
x = torch.linspace(0., 1., 100000).cuda()
y = torch.linspace(0., 1., 100000).cuda()
out = sum_exp_torch(x, y)
In this example, out[14101:]
is zero.
Additional comments The example I provided is on GPU, but I experience the same problem on CPU, and across multiple machines with different hardware. Also, using two nested loops resolves the problem, but it looses performance at least on small matrices, and I guess also some optimisations like TLS or Shared Memory ?
Seems that M*M has overflown int32. There may be some difficulties to resolve it since some back-ends do not support int64. We will investigate it later.
Seems that M*M has overflown int32. There may be some difficulties to resolve it since some back-ends do not support int64. We will investigate it later.
Hello, I think a prompt should pop up when using an unsupported backend, there are actually many people who use Taichi to do numerical calculations on CUDA or X86CPUs, hoping to relax the restrictions