taichi icon indicating copy to clipboard operation
taichi copied to clipboard

Torch 1.12.0 and Taichi cannot use CUDA at the same time.

Open xuan-li opened this issue 2 years ago • 4 comments

Describe the bug If taichi is initialized with GPU, Torch cannot execute backward.

PyTorch version: 1.12.0

To Reproduce

import taichi as ti
import torch

device = torch.device("cuda:0")
ti.init(arch=ti.gpu)

x = torch.tensor([1.], requires_grad=True, device=device)
loss = x ** 2
loss.backward()

Log/Screenshots

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12
[Taichi] Starting on arch=cuda
Traceback (most recent call last):
  File "******", line 10, in <module>
    loss.backward()
  File "******/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "******/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Event device type CUDA does not match blocking stream's device type CPU.

Additional comments ti diagnose:

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12

*******************************************
**      Taichi Programming Language      **
*******************************************

Docs:   https://docs.taichi-lang.org/
GitHub: https://github.com/taichi-dev/taichi/
Forum:  https://forum.taichi.graphics/

Taichi system diagnose:

python: 3.9.12 (main, Jun  1 2022, 11:38:51) 
[GCC 7.5.0]
system: linux
executable: /home/xuan/miniconda3/envs/dl/bin/python
platform: Linux-5.13.0-51-generic-x86_64-with-glibc2.31
architecture: 64bit ELF
uname: uname_result(system='Linux', node='Wanzi', release='5.13.0-51-generic', version='#58~20.04.1-Ubuntu SMP Tue Jun 14 11:29:12 UTC 2022', machine='x86_64')
locale: en_US.UTF-8
PATH: /home/xuan/miniconda3/envs/dl/bin:/home/xuan/miniconda3/condabin:/snap/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
PYTHONPATH: ['/home/xuan/miniconda3/envs/dl/bin', '/home/xuan/miniconda3/envs/dl/lib/python39.zip', '/home/xuan/miniconda3/envs/dl/lib/python3.9', '/home/xuan/miniconda3/envs/dl/lib/python3.9/lib-dynload', '/home/xuan/miniconda3/envs/dl/lib/python3.9/site-packages']

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.4 LTS
Release:	20.04
Codename:	focal



import: <module 'taichi' from '/home/xuan/miniconda3/envs/dl/lib/python3.9/site-packages/taichi/__init__.py'>

cc: False
cpu: True
metal: False
opengl: True
cuda: True
MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0

vulkan: True

`glewinfo` not available: [Errno 2] No such file or directory: 'glewinfo'

Sat Jul 23 20:38:44 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 20%   37C    P0    N/A /  75W |    241MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1115      G   /usr/lib/xorg/Xorg                 35MiB |
|    0   N/A  N/A      1711      G   /usr/lib/xorg/Xorg                120MiB |
|    0   N/A  N/A      1847      G   /usr/bin/gnome-shell                9MiB |
|    0   N/A  N/A      2284      G   ...AAAAAAAAA= --shared-files        6MiB |
+-----------------------------------------------------------------------------+

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12
[Taichi] Starting on arch=x64

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12
[Taichi] Starting on arch=opengl

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12
[Taichi] Starting on arch=cuda

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12

*******************************************
**      Taichi Programming Language      **
*******************************************

Docs:   https://docs.taichi-lang.org/
GitHub: https://github.com/taichi-dev/taichi/
Forum:  https://forum.taichi.graphics/

                                 TAICHI EXAMPLES                                  
 ──────────────────────────────────────────────────────────────────────────────── 
  0: ad_gravity               21: keyboard                42: odop_solar          
  1: comet                    22: laplace                 43: patterns            
  2: cornell_box              23: mandelbrot_zoom         44: pbf2d               
  3: diff_sph                 24: marching_squares        45: physarum            
  4: euler                    25: mass_spring_3d_ggui     46: print_offset        
  5: explicit_activation      26: mass_spring_game        47: rasterizer          
  6: export_mesh              27: mass_spring_game_ggui   48: regression          
  7: export_ply               28: mciso_advanced          49: sdf_renderer        
  8: export_videos            29: mgpcg                   50: simple_derivative   
  9: fem128                   30: mgpcg_advanced          51: simple_texture      
  10: fem128_ggui             31: minimal                 52: simple_uv           
  11: fem99                   32: minimization            53: stable_fluid        
  12: fractal                 33: mpm128                  54: stable_fluid_ggui   
  13: fractal3d_ggui          34: mpm128_ggui             55: stable_fluid_graph  
  14: fullscreen              35: mpm3d                   56: taichi_bitmasked    
  15: game_of_life            36: mpm3d_ggui              57: taichi_dynamic      
  16: gui_image_io            37: mpm88                   58: taichi_logo         
  17: gui_widgets             38: mpm88_graph             59: taichi_sparse       
  18: implicit_fem            39: mpm99                   60: tutorial            
  19: implicit_mass_spring    40: mpm_lagrangian_forces   61: vortex_rings        
  20: initial_value_problem   41: nbody                   62: waterwave           
 ──────────────────────────────────────────────────────────────────────────────── 
Running example minimal ...
[Taichi] Starting on arch=x64
42.0
>>> Running time: 0.16s
42

Consider attaching this log when maintainers ask about system information.
>>> Running time: 9.21s

xuan-li avatar Jul 24 '22 03:07 xuan-li

Taichi can work with PyTorch 1.10.0.

xuan-li avatar Jul 24 '22 03:07 xuan-li

I reproduced the error too. @erizmr Can you look into this error?

lin-hitonami avatar Jul 25 '22 08:07 lin-hitonami

I am looking into it.

erizmr avatar Jul 25 '22 08:07 erizmr

FYI: https://github.com/taichi-dev/taichi/issues/2190 and https://github.com/taichi-dev/taichi/issues/4944

k-ye avatar Jul 25 '22 08:07 k-ye

Hi, I have also met the same error. I've tried different pytorch versions - it seems 1.11 and 1.12 have this issue, while 1.10 does not.

pableeto avatar Aug 21 '22 03:08 pableeto

Made some inspection inspired by this pytorch issue Firstly, pip install cuda-python

Full code with cuda driver helpers:

import torch
import taichi as ti
from cuda import cuda, cudart
import torch

def ASSERT_DRV(err):
    """
    This is a helper function to turn CUDA messages into errors when
    appropriate, since by default the CUDA package doesn't raise
    Python errors, it returns error messages
    """
    if isinstance(err, cuda.CUresult):
        if err != cuda.CUresult.CUDA_SUCCESS:
            raise RuntimeError("Cuda Error: {}".format(err))
    elif isinstance(err, cudart.cudaError_t):
        if err != cudart.cudaError_t.cudaSuccess:
            raise RuntimeError("Cudart Error: {}".format(err))
    else:
        raise RuntimeError("Unknown error type: {}".format(err))


def print_existing_contexts():
    valid_contexts = []
    while True:
        err, cuda_context = cuda.cuCtxPopCurrent()
        try:
            ASSERT_DRV(err)
        except RuntimeError:
            break
        else:
            valid_contexts.append(cuda_context)

    print("Existing, valid contexts: ", valid_contexts)

    for curr_ctx in reversed(valid_contexts):
        err, = cuda.cuCtxPushCurrent(curr_ctx)
        ASSERT_DRV(err)
        
device = torch.device("cuda:0")
print("===AFTER TORCH DEVICE INIT===")
print_existing_contexts()
print(torch._C._cuda_hasPrimaryContext(0))
x = torch.tensor([1.], requires_grad=True, device=device)
print("===AFTER TORCH TENSOR INIT===")
print_existing_contexts()
ti.init(arch=ti.gpu, log_level=ti.TRACE)
print("===AFTER TI INIT===")
print_existing_contexts()
print("Torch has primary context", torch._C._cuda_hasPrimaryContext(0))
loss = x**2
loss.backward()
print(torch._C._cuda_hasPrimaryContext(0))

This is the proper code that can work. What we see from the log: image

Taichi ignored the PyTorch CUDA context and created its own. If we change the initialization order:

ti.init(arch=ti.gpu, log_level=ti.TRACE)
print("===AFTER TI INIT===")
print_existing_contexts()
device = torch.device("cuda:0")
print("===AFTER TORCH DEVICE INIT===")
print_existing_contexts()
print(torch._C._cuda_hasPrimaryContext(0))
x = torch.tensor([1.], requires_grad=True, device=device)
print("===AFTER TORCH TENSOR INIT===")
print_existing_contexts()
# print("TAICHI "ti._lib.core.get_primary_ctx_state())
print("Torch has primary context", torch._C._cuda_hasPrimaryContext(0))
loss = x**2
loss.backward()
print(torch._C._cuda_hasPrimaryContext(0))

We encouter the error. image

Torch just fetches the CUcontext created by Taichi, and that CUDA context is not synced.

That said, to work with PyTorch, we should pop out Taichi's CUDA context at the end of ti.init. PyTorch would create its own primary context. When Taichi needs the CUDA context in subsequent exec, it always set the Context to its own ctx pointer so the pop out is fine for Taichi.

turbo0628 avatar Aug 26 '22 07:08 turbo0628