[diff] error when big inner loop in an autodiff kernel
Describe the bug An error raised when write a big inner loop in a kernel and use cpu or cuda backend.
To Reproduce
import taichi as ti
ti.init(arch=ti.gpu)
@ti.kernel
def compute():
for i in range(10):
weight_sum = 0.0
for j in range(500):
weight_sum += 1.0
compute()
compute.grad()
I removed all fields in this example because it does not affect the recurrence of the problem.
Log/Screenshots for cuda backend:
$ python3 ti_diff.py
[Taichi] version 1.6.0, llvm 15.0.4, commit f1c6fbbd, linux, python 3.8.10
[Taichi] Starting on arch=cuda
[E 05/16/23 22:28:45.039 54619] [cuda_driver.h:operator()@92] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while calling stream_synchronize (cuStreamSynchronize)
terminate called after throwing an instance of 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >'
Aborted (core dumped)
for cpu backend:
$ python3 ti_diff.py
[Taichi] version 1.6.0, llvm 15.0.4, commit f1c6fbbd, linux, python 3.8.10
[Taichi] Starting on arch=x64
Segmentation fault (core dumped)
Additional comments This error only occurs when the number of inner loops is large enough. For my computer (Intel i7-9700K x8 with RTX 2070s), error occurs when n >= 147 for cuda and n >= 31 for cpu. I guess this is related to running out of local registers.
When I use vunkan backend, this error never occur. So only llvm backend uses local registers for inner loop in grad?
Strangely, when I slightly modified the program, the threshold of loop number changed.
@ti.kernel
def compute():
for i in range(10):
weight_sum = 0.0
for j in range(500):
if j >= 0: # only add this line.
weight_sum += 1.0
Now error occurs when n >= 179 for cuda and n >= 32 for cpu.
This example comes from my actual project, and currently I can only temporarily use the Vunkan backend to solve this problem. The problem-solving of CUDA backend has been very helpful to me, thank you very much.
@Dinngger It crashes since since mutation chain is too long, so if you do ti.init(ad_stack_size=512) it works.
Note although vulkan(spirv-based backends) didn't error out, they may produce wrong result due to lack to mutation stack recording. So the recommended backend to use autodiff are cpu/cuda backends.
I still have some doubts. In the following example, the results of autodiff and my manual grad are inconsistent. Perhaps the autodiff for reduce operations needs to be re implemented?
import numpy as np
import taichi as ti
ti.init(arch=ti.gpu, debug=True, ad_stack_size=512)
N = 2
M = 2
x = ti.field(ti.f32, shape=(N, M), needs_grad=True)
x.from_numpy(np.random.randn(N, M).astype(np.float32))
my_x_grad = ti.field(ti.f32, shape=(N, M))
loss = ti.field(ti.f32, shape=(), needs_grad=True)
@ti.kernel
def compute():
for i in range(N):
x_sum = 0.0
for j in range(M):
x_sum += x[i, j]
loss[None] += ti.exp(x_sum)
@ti.kernel
def compute_grad():
for i in range(N):
# forward again
x_sum = 0.0
for j in range(M):
x_sum += x[i, j]
# backward
x_sum_grad = loss.grad[None] * x_sum
for j in range(M):
my_x_grad[i, j] = x_sum_grad
loss[None] = 0
compute()
loss.grad[None] = 1
compute.grad()
compute_grad()
print(x.to_numpy())
print(x.grad.to_numpy())
print(my_x_grad.to_numpy())
@Dinngger This is indeed a bug, thanks for reporting. Taking a look...
Any updates on this? Or a work around?
Believe I'm dealing with the same issue though I don't have it isolated in my script quite so nicely. I can, however, reproduce the original reported issues above in 1.6.0 and 1.7.0.