taichi
taichi copied to clipboard
Torch 1.12.0 and Taichi cannot use CUDA at the same time.
Describe the bug If taichi is initialized with GPU, Torch cannot execute backward.
PyTorch version: 1.12.0
To Reproduce
import taichi as ti
import torch
device = torch.device("cuda:0")
ti.init(arch=ti.gpu)
x = torch.tensor([1.], requires_grad=True, device=device)
loss = x ** 2
loss.backward()
Log/Screenshots
[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12
[Taichi] Starting on arch=cuda
Traceback (most recent call last):
File "******", line 10, in <module>
loss.backward()
File "******/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "******/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Event device type CUDA does not match blocking stream's device type CPU.
Additional comments
ti diagnose
:
[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12
*******************************************
** Taichi Programming Language **
*******************************************
Docs: https://docs.taichi-lang.org/
GitHub: https://github.com/taichi-dev/taichi/
Forum: https://forum.taichi.graphics/
Taichi system diagnose:
python: 3.9.12 (main, Jun 1 2022, 11:38:51)
[GCC 7.5.0]
system: linux
executable: /home/xuan/miniconda3/envs/dl/bin/python
platform: Linux-5.13.0-51-generic-x86_64-with-glibc2.31
architecture: 64bit ELF
uname: uname_result(system='Linux', node='Wanzi', release='5.13.0-51-generic', version='#58~20.04.1-Ubuntu SMP Tue Jun 14 11:29:12 UTC 2022', machine='x86_64')
locale: en_US.UTF-8
PATH: /home/xuan/miniconda3/envs/dl/bin:/home/xuan/miniconda3/condabin:/snap/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
PYTHONPATH: ['/home/xuan/miniconda3/envs/dl/bin', '/home/xuan/miniconda3/envs/dl/lib/python39.zip', '/home/xuan/miniconda3/envs/dl/lib/python3.9', '/home/xuan/miniconda3/envs/dl/lib/python3.9/lib-dynload', '/home/xuan/miniconda3/envs/dl/lib/python3.9/site-packages']
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal
import: <module 'taichi' from '/home/xuan/miniconda3/envs/dl/lib/python3.9/site-packages/taichi/__init__.py'>
cc: False
cpu: True
metal: False
opengl: True
cuda: True
MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0
vulkan: True
`glewinfo` not available: [Errno 2] No such file or directory: 'glewinfo'
Sat Jul 23 20:38:44 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 20% 37C P0 N/A / 75W | 241MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1115 G /usr/lib/xorg/Xorg 35MiB |
| 0 N/A N/A 1711 G /usr/lib/xorg/Xorg 120MiB |
| 0 N/A N/A 1847 G /usr/bin/gnome-shell 9MiB |
| 0 N/A N/A 2284 G ...AAAAAAAAA= --shared-files 6MiB |
+-----------------------------------------------------------------------------+
[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12
[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12
[Taichi] Starting on arch=x64
[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12
[Taichi] Starting on arch=opengl
[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12
[Taichi] Starting on arch=cuda
[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12
*******************************************
** Taichi Programming Language **
*******************************************
Docs: https://docs.taichi-lang.org/
GitHub: https://github.com/taichi-dev/taichi/
Forum: https://forum.taichi.graphics/
TAICHI EXAMPLES
────────────────────────────────────────────────────────────────────────────────
0: ad_gravity 21: keyboard 42: odop_solar
1: comet 22: laplace 43: patterns
2: cornell_box 23: mandelbrot_zoom 44: pbf2d
3: diff_sph 24: marching_squares 45: physarum
4: euler 25: mass_spring_3d_ggui 46: print_offset
5: explicit_activation 26: mass_spring_game 47: rasterizer
6: export_mesh 27: mass_spring_game_ggui 48: regression
7: export_ply 28: mciso_advanced 49: sdf_renderer
8: export_videos 29: mgpcg 50: simple_derivative
9: fem128 30: mgpcg_advanced 51: simple_texture
10: fem128_ggui 31: minimal 52: simple_uv
11: fem99 32: minimization 53: stable_fluid
12: fractal 33: mpm128 54: stable_fluid_ggui
13: fractal3d_ggui 34: mpm128_ggui 55: stable_fluid_graph
14: fullscreen 35: mpm3d 56: taichi_bitmasked
15: game_of_life 36: mpm3d_ggui 57: taichi_dynamic
16: gui_image_io 37: mpm88 58: taichi_logo
17: gui_widgets 38: mpm88_graph 59: taichi_sparse
18: implicit_fem 39: mpm99 60: tutorial
19: implicit_mass_spring 40: mpm_lagrangian_forces 61: vortex_rings
20: initial_value_problem 41: nbody 62: waterwave
────────────────────────────────────────────────────────────────────────────────
Running example minimal ...
[Taichi] Starting on arch=x64
42.0
>>> Running time: 0.16s
42
Consider attaching this log when maintainers ask about system information.
>>> Running time: 9.21s
Taichi can work with PyTorch 1.10.0.
I reproduced the error too. @erizmr Can you look into this error?
I am looking into it.
FYI: https://github.com/taichi-dev/taichi/issues/2190 and https://github.com/taichi-dev/taichi/issues/4944
Hi, I have also met the same error. I've tried different pytorch versions - it seems 1.11 and 1.12 have this issue, while 1.10 does not.
Made some inspection inspired by this pytorch issue
Firstly, pip install cuda-python
Full code with cuda driver helpers:
import torch
import taichi as ti
from cuda import cuda, cudart
import torch
def ASSERT_DRV(err):
"""
This is a helper function to turn CUDA messages into errors when
appropriate, since by default the CUDA package doesn't raise
Python errors, it returns error messages
"""
if isinstance(err, cuda.CUresult):
if err != cuda.CUresult.CUDA_SUCCESS:
raise RuntimeError("Cuda Error: {}".format(err))
elif isinstance(err, cudart.cudaError_t):
if err != cudart.cudaError_t.cudaSuccess:
raise RuntimeError("Cudart Error: {}".format(err))
else:
raise RuntimeError("Unknown error type: {}".format(err))
def print_existing_contexts():
valid_contexts = []
while True:
err, cuda_context = cuda.cuCtxPopCurrent()
try:
ASSERT_DRV(err)
except RuntimeError:
break
else:
valid_contexts.append(cuda_context)
print("Existing, valid contexts: ", valid_contexts)
for curr_ctx in reversed(valid_contexts):
err, = cuda.cuCtxPushCurrent(curr_ctx)
ASSERT_DRV(err)
device = torch.device("cuda:0")
print("===AFTER TORCH DEVICE INIT===")
print_existing_contexts()
print(torch._C._cuda_hasPrimaryContext(0))
x = torch.tensor([1.], requires_grad=True, device=device)
print("===AFTER TORCH TENSOR INIT===")
print_existing_contexts()
ti.init(arch=ti.gpu, log_level=ti.TRACE)
print("===AFTER TI INIT===")
print_existing_contexts()
print("Torch has primary context", torch._C._cuda_hasPrimaryContext(0))
loss = x**2
loss.backward()
print(torch._C._cuda_hasPrimaryContext(0))
This is the proper code that can work. What we see from the log:
Taichi ignored the PyTorch CUDA context and created its own. If we change the initialization order:
ti.init(arch=ti.gpu, log_level=ti.TRACE)
print("===AFTER TI INIT===")
print_existing_contexts()
device = torch.device("cuda:0")
print("===AFTER TORCH DEVICE INIT===")
print_existing_contexts()
print(torch._C._cuda_hasPrimaryContext(0))
x = torch.tensor([1.], requires_grad=True, device=device)
print("===AFTER TORCH TENSOR INIT===")
print_existing_contexts()
# print("TAICHI "ti._lib.core.get_primary_ctx_state())
print("Torch has primary context", torch._C._cuda_hasPrimaryContext(0))
loss = x**2
loss.backward()
print(torch._C._cuda_hasPrimaryContext(0))
We encouter the error.
Torch just fetches the CUcontext created by Taichi, and that CUDA context is not synced.
That said, to work with PyTorch, we should pop out Taichi's CUDA context at the end of ti.init
. PyTorch would create its own primary context. When Taichi needs the CUDA context in subsequent exec, it always set the Context to its own ctx pointer so the pop out is fine for Taichi.