taichi
taichi copied to clipboard
dynamic node raises CUDA error with torch training codes
-
Environment:
- [Taichi] version 1.0.0, llvm 10.0.0, commit 6a15da85, linux, python 3.8.5
- [Torch] : 1.8.1+cu101
- [GPU] : 1080Ti
-
Description: for issue 4937, I have to use dynamic node. But it will raise cuda error after some iterations.
[E 05/10/22 16:10:18.990 50390] [cuda_driver.h:operator()@87] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while calling stream_synchronize (cuStreamSynchronize)
In windows RTX 2060 8GB, it raises another error. (Codes work, when replacing the dynamic node with dense field, or removing loss.backward()
)
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
- Sample codes:
import torch
import torch.nn.functional as F
import torchvision
import taichi as ti
import logging
logging.basicConfig(level=logging.INFO,
format='%(asctime)s %(message)s')
ti.init(arch=ti.cuda, device_memory_GB=2)
grp_res = ti.field(ti.i32)
_grp_res_pixel = ti.root.dynamic(
ti.i, 32 * 1024)
_grp_res_pixel.place(grp_res)
# _grp_res_pixel = ti.root.dense(
# ti.i, 32 * 1024)
# _grp_res_pixel.place(grp_res)
device = 'cuda'
model = torchvision.models.resnet18().to(device)
model.train()
optimizer = torch.optim.Adam(model.parameters())
@ti.kernel
def fake_deactivation():
for i in range(grp_res.shape[0]):
grp_res[i] = 0
for i in range(100):
_grp_res_pixel.deactivate_all()
# fake_deactivation()
x = torch.randn(32, 3, 224, 224).to(device)
y = model(x)
target = torch.randint(1000, (32,), dtype=torch.int64).to(device)
loss = F.cross_entropy(y, target)
model.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(
model.parameters(), 5)
optimizer.step()
logging.info(f'{i}: {loss.item()}')
Not sure if this is related to the shared CUDA context issue, see https://github.com/taichi-dev/taichi/issues/2190
Were you able to figure this out?