Genesis icon indicating copy to clipboard operation
Genesis copied to clipboard

[Bug]: CUDA_ERROR_ILLEGAL_ADDRESS when using RTX 5090 GPU

Open pjw971022 opened this issue 11 months ago • 7 comments

Bug Description

When I ran Genesis/example/locomotion/go2_train.py, I encountered the following error. There was no issue with the 4060 Ti, but this problem occurs when running on the 5090 GPU. I'm wondering if anyone has experienced similar issues and what solutions you might have. No combination of versions seems to work.

The additional findings

  1. when running "scene.add_entity(gs.morphs.URDF(file="urdf/plane/plane.urdf", fixed=True))", an error occurs, but when that part is modified to "self.scene.add_entity(gs.morphs.Plane())", the problem does not occur.
  2. When running with 1,024 environments, the learning progresses properly, but when increased to 2,048 environments, issues occur where objects fall through the plane and drop to the floor, or memory issues arise causing termination.

Steps to Reproduce

If possible, provide a script triggering the bug, e.g.

python Genesis/example/locomotion/go2_train.py

Expected Behavior

The script (Genesis/example/locomotion/go2_train.py) should run without CUDA errors on the RTX 5090 GPU, just as it does on the RTX 4060 Ti. The training process should execute normally without any CUDA_ERROR_ILLEGAL_ADDRESS errors.

Screenshots/Videos

No response

Relevant log output


 [E 05/19/25 10:42:14.988 189835] [cuda_driver.h:operator()@92] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while calling stream_synchronize (cuStreamSynchronize)


Traceback (most recent call last):
  File "/home/-/workspace/Genesis/examples/locomotion/go2_train.py", line 180, in <module>
    main()
  File "/home/-/workspace/Genesis/examples/locomotion/go2_train.py", line 176, in main
    runner.learn(num_learning_iterations=args.max_iterations, init_at_random_ep_len=True)
  File "/home/-/.local/lib/python3.10/site-packages/rsl_rl/runners/on_policy_runner.py", line 151, in learn
    obs, rewards, dones, infos = self.env.step(actions.to(self.env.device))
  File "/home/-/workspace/Genesis/examples/locomotion/go2_env.py", line 129, in step
    self.base_pos[:] = self.robot.get_pos()
  File "/home/-/workspace/Genesis/genesis/utils/misc.py", line 72, in wrapper
    return method(self, *args, **kwargs)
  File "/home/-/workspace/Genesis/genesis/engine/entities/rigid_entity/rigid_entity.py", line 1672, in get_pos
    return self._solver.get_links_pos(self._base_links_idx, envs_idx, unsafe=unsafe).squeeze(-2)
  File "/home/-/workspace/Genesis/genesis/engine/solvers/rigid/rigid_solver_decomp.py", line 4483, in get_links_pos
    tensor = ti_field_to_torch(self.links_state.pos, envs_idx, links_idx, transpose=True, unsafe=unsafe)
  File "/home/-/workspace/Genesis/genesis/utils/misc.py", line 450, in ti_field_to_torch
    ti.sync()
  File "/home/-/.local/lib/python3.10/site-packages/taichi/lang/runtime_ops.py", line 8, in sync
    impl.get_runtime().sync()
  File "/home/-/.local/lib/python3.10/site-packages/taichi/lang/impl.py", line 499, in sync
    self.prog.synchronize()
RuntimeError: [cuda_driver.h:operator()@92] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while calling stream_synchronize (cuStreamSynchronize)

[Genesis] [10:42:14] [ERROR] RuntimeError: [cuda_driver.h:operator()@92] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while calling stream_synchronize (cuStreamSynchronize)
[E 05/19/25 10:42:15.164 189835] [cuda_driver.h:operator()@92] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while calling stream_synchronize (cuStreamSynchronize)


Exception ignored in atexit callback: <function destroy at 0x724b395d4550>
Traceback (most recent call last):
  File "/home/-/workspace/Genesis/genesis/__init__.py", line 271, in destroy
    ti.reset()
  File "/home/-/.local/lib/python3.10/site-packages/taichi/lang/misc.py", line 220, in reset
    impl.reset()
  File "/home/-/.local/lib/python3.10/site-packages/taichi/lang/impl.py", line 512, in reset
    pytaichi.clear()
  File "/home/-/.local/lib/python3.10/site-packages/taichi/lang/impl.py", line 492, in clear
    self.prog.finalize()
RuntimeError: [cuda_driver.h:operator()@92] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while calling stream_synchronize (cuStreamSynchronize)
[E 05/19/25 10:42:15.569 189835] [cuda_driver.h:operator()@92] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while calling stream_synchronize (cuStreamSynchronize)


[E 05/19/25 10:42:15.569 189835] [cuda_driver.h:operator()@92] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while calling mem_free (cuMemFree_v2)


terminate called after throwing an instance of 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >'

Environment

  • OS: Ubuntu 24.04, 22.04
  • GPU/CPU: 5090 / Intel(R) Core(TM) Ultra 7 265K
  • GPU-driver version: 570.144
  • CUDA / CUDA-toolkit version: 12.8
  • torch ver.: 2.7.0+cu128

Release version or Commit ID

v0.2.1-312-g37c1ce6

Additional Context

No response

pjw971022 avatar May 19 '25 04:05 pjw971022

I am having the same issue on my 5080 GPU

moribots avatar May 20 '25 05:05 moribots

I believe if you install via pip and not github, Genesis will function on 50 series hardware.

bb7332193 avatar May 20 '25 23:05 bb7332193

I believe the pip version has a different issue: https://github.com/Genesis-Embodied-AI/Genesis/issues/1156

moribots avatar May 22 '25 03:05 moribots

I tried the alternative pip install (pip install git+https://github.com/Genesis-Embodied-AI/Genesis.git) and got

RuntimeError: [cuda_driver.h:operator()@92] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while calling stream_synchronize (cuStreamSynchronize)
[E 05/21/25 23:22:19.256 11897] [cuda_driver.h:operator()@92] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while calling stream_synchronize (cuStreamSynchronize)

by running https://github.com/Genesis-Embodied-AI/Genesis/blob/main/examples/speed_benchmark/franka.py


# https://github.com/Genesis-Embodied-AI/Genesis/blob/main/examples/speed_benchmark/franka.py

import torch
import genesis as gs

########################## init ##########################
gs.init(backend=gs.gpu)

########################## create a scene ##########################
scene = gs.Scene(
    show_viewer=False,
    viewer_options=gs.options.ViewerOptions(
        camera_pos=(3.5, -1.0, 2.5),
        camera_lookat=(0.0, 0.0, 0.5),
        camera_fov=40,
        res=(1920, 1080),
    ),
    rigid_options=gs.options.RigidOptions(
        dt=0.01,
    ),
)

########################## entities ##########################
plane = scene.add_entity(
    gs.morphs.Plane(),
)

franka = scene.add_entity(
    gs.morphs.MJCF(file="xml/franka_emika_panda/panda.xml"),
)

########################## build ##########################

# create 20 parallel environments
B = 4096
scene.build(n_envs=B, env_spacing=(1.0, 1.0))

# control all the robots
# with the following control: 43M FPS
# without the following control (arm in collision with the floor): 32M FPS
franka.control_dofs_position(
    torch.tile(torch.tensor([0, 0, 0, -1.0, 0, 0, 0,
               0.02, 0.02], device=gs.device), (B, 1)),
)

for i in range(1000):
    scene.step()

moribots avatar May 22 '25 03:05 moribots

interestingly, if I reduce the # envs to 2048 it no longer produces this error. However, this leaves plenty of available GPU memory on the table so it's not ideal

moribots avatar May 22 '25 03:05 moribots

I can confirm with smaller batch size, in my case 1024, the error did not appear.

Kashu7100 avatar May 22 '25 18:05 Kashu7100

It's a Taichi bug. We have made an minimal example and created an issue for Taichi. We are also working on it at the same time.

https://github.com/taichi-dev/taichi/issues/8730

YilingQiao avatar May 27 '25 03:05 YilingQiao

Also experienced on NVIDIA L40S. Clearing caches with gs clean helped for me.

Milotrince avatar Jun 14 '25 00:06 Milotrince

@Milotrince For the 5090, @hughperkins found that the bug is on NVIDIA's side and reported to them. You can reproduce it using pure CUDA code. Here are the instructions for reproduction.

https://github.com/hughperkins/taichi-play/tree/main/run_ir/nvidia_bug_report_for_8730

Could you also try it on your L40s?

YilingQiao avatar Jun 14 '25 00:06 YilingQiao

So sorry, the L40S are on a cluster and I've tried but haven't been able to reproduce the issue for some reason

Milotrince avatar Jun 17 '25 20:06 Milotrince

NVidia state that they have fixed the bug in their codebase.

  • will be released in CUDA 13.1
  • since we are using JIT, it might be fixed for our purposes in the next driver release (~4 weeks?)

hughperkins avatar Jun 24 '25 19:06 hughperkins

Note: maybe fixed in 180.x.x driver release? (I tried some initial smoke tests, and worked for me, but haven't rigorously tested the specific issues in this particular github issue)

hughperkins avatar Sep 04 '25 12:09 hughperkins