taichi icon indicating copy to clipboard operation
taichi copied to clipboard

dynamic node raises CUDA error with torch training codes

Open jhonsonlaid opened this issue 2 years ago • 2 comments

  • Environment:

    • [Taichi] version 1.0.0, llvm 10.0.0, commit 6a15da85, linux, python 3.8.5
    • [Torch] : 1.8.1+cu101
    • [GPU] : 1080Ti
  • Description: for issue 4937, I have to use dynamic node. But it will raise cuda error after some iterations.

[E 05/10/22 16:10:18.990 50390] [cuda_driver.h:operator()@87] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while calling stream_synchronize (cuStreamSynchronize)

In windows RTX 2060 8GB, it raises another error. (Codes work, when replacing the dynamic node with dense field, or removing loss.backward())

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
  • Sample codes:
import torch
import torch.nn.functional as F
import torchvision
import taichi as ti
import logging

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(message)s')

ti.init(arch=ti.cuda, device_memory_GB=2)
grp_res = ti.field(ti.i32)
_grp_res_pixel = ti.root.dynamic(
    ti.i, 32 * 1024)
_grp_res_pixel.place(grp_res)

# _grp_res_pixel = ti.root.dense(
#     ti.i, 32 * 1024)
# _grp_res_pixel.place(grp_res)

device = 'cuda'
model = torchvision.models.resnet18().to(device)
model.train()
optimizer = torch.optim.Adam(model.parameters())


@ti.kernel
def fake_deactivation():
    for i in range(grp_res.shape[0]):
        grp_res[i] = 0


for i in range(100):
    _grp_res_pixel.deactivate_all()
    # fake_deactivation()

    x = torch.randn(32, 3, 224, 224).to(device)
    y = model(x)
    target = torch.randint(1000, (32,), dtype=torch.int64).to(device)
    loss = F.cross_entropy(y, target)

    model.zero_grad()
    loss.backward()

    torch.nn.utils.clip_grad_norm_(
        model.parameters(), 5)

    optimizer.step()
    logging.info(f'{i}: {loss.item()}')

jhonsonlaid avatar May 10 '22 08:05 jhonsonlaid

Not sure if this is related to the shared CUDA context issue, see https://github.com/taichi-dev/taichi/issues/2190

k-ye avatar May 13 '22 07:05 k-ye

Were you able to figure this out?

keunhong avatar May 03 '24 06:05 keunhong