IsaacLab icon indicating copy to clipboard operation
IsaacLab copied to clipboard

Issue on Training Crash

Open jie-zh opened this issue 8 months ago • 2 comments

system configuration

  • ubuntu: 22.04
  • cuda: 12.4
  • nvidia driver: 550.120
  • isaaclab: 2.02
  • isaacsim: 4.5.0

The following is the crash report, has someone met the same problem? At first, when i train about 10k or less iterations, the crash would happen. I have tried several methods to solve the problem according to the report, including 1: changing the logging tool from tensorboard to neptune 2: Disabling the newest instruction set 3: transitioning ubuntu version to 20.04/24.04 , but neither of them works. Now the crash happens almost at the beginning of the training, at about less then 1k iterations.

The only valid solution to the problem is using windows instead of ubuntu. The training can normally run on windows, which proves that the hardware works well.

Fatal Python error: Illegal instruction

Thread 0x00007714d97fe6c0 (most recent call first):
  <no Python frame>

Thread 0x00007714d9fff6c0 (most recent call first):
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/threading.py", line 324 in wait
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/queue.py", line 180 in get
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 269 in _run
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 244 in run
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x0000771a4f0dd600 (most recent call first):
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/torch/optim/adam.py", line 613 in _multi_tensor_adam
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/torch/optim/adam.py", line 784 in adam
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/torch/optim/optimizer.py", line 154 in maybe_fallback
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/torch/optim/adam.py", line 223 in step
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/torch/optim/optimizer.py", line 91 in _use_grad
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/torch/optim/optimizer.py", line 487 in wrapper
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/rsl_rl/algorithms/ppo.py", line 386 in update
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/rsl_rl/runners/on_policy_runner.py", line 260 in learn
  File "/home/agile/robot_lab/scripts/rsl_rl/base/train.py", line 162 in main
  File "/home/agile/IsaacLab/source/isaaclab_tasks/isaaclab_tasks/utils/hydra.py", line 101 in hydra_main
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/hydra/core/utils.py", line 186 in run_job
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 119 in run
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458 in <lambda>
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220 in run_and_report
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457 in _run_app
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394 in _run_hydra
  File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/hydra/main.py", line 94 in decorated_main
  File "/home/agile/IsaacLab/source/isaaclab_tasks/isaaclab_tasks/utils/hydra.py", line 104 in wrapper
  File "/home/agile/robot_lab/scripts/rsl_rl/base/train.py", line 170 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, hid, PIL._imaging, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, psutil._psutil_linux, psutil._psutil_posix, lxml._elementpath, lxml.etree, scipy.ndimage._nd_image, scipy.ndimage._rank_filter_1d, _ni_label, scipy.ndimage._ni_label, yaml._yaml, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.interpolate._dierckx, scipy.interpolate._ppoly, scipy.interpolate._interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.interpolate._bspl, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.utils, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5o, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5l, h5py._selector, kiwisolver._cext, PIL._imagingft, google.protobuf.pyext._message (total: 126)
Illegal instruction (core dumped)

Image

jie-zh avatar Apr 23 '25 08:04 jie-zh

Thank you for posting this. Could you please try the recommended driver? also, could you share what are you running from the terminal?

RandomOakForest avatar Apr 25 '25 21:04 RandomOakForest

Thank you for posting this. Could you please try the recommended driver? also, could you share what are you running from the terminal?

Thank for the response. I have also tried driver 550.144 and also 560, as expressed in document the latest driver is recommended. The running command from the terminal is common like this: "./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/train.py --task Isaac-Velocity-Flat-G1-v0 --headless" . Some changes about rewards are made in the environment.

jie-zh avatar Apr 27 '25 01:04 jie-zh

hi , I have similar issue as yours , do you solve this problem?

orangelee89 avatar May 19 '25 22:05 orangelee89

Following up, if you still face this issue with the 535.129.03 driver, please open a new issue as a bug report. I will close this issue for now. Thank you for your interest in Isaac Lab.

RandomOakForest avatar Jun 03 '25 20:06 RandomOakForest