Issue on Training Crash
system configuration
- ubuntu: 22.04
- cuda: 12.4
- nvidia driver: 550.120
- isaaclab: 2.02
- isaacsim: 4.5.0
The following is the crash report, has someone met the same problem? At first, when i train about 10k or less iterations, the crash would happen. I have tried several methods to solve the problem according to the report, including 1: changing the logging tool from tensorboard to neptune 2: Disabling the newest instruction set 3: transitioning ubuntu version to 20.04/24.04 , but neither of them works. Now the crash happens almost at the beginning of the training, at about less then 1k iterations.
The only valid solution to the problem is using windows instead of ubuntu. The training can normally run on windows, which proves that the hardware works well.
Fatal Python error: Illegal instruction
Thread 0x00007714d97fe6c0 (most recent call first):
<no Python frame>
Thread 0x00007714d9fff6c0 (most recent call first):
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/threading.py", line 324 in wait
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/queue.py", line 180 in get
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 269 in _run
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 244 in run
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/threading.py", line 973 in _bootstrap
Current thread 0x0000771a4f0dd600 (most recent call first):
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/torch/optim/adam.py", line 613 in _multi_tensor_adam
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/torch/optim/adam.py", line 784 in adam
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/torch/optim/optimizer.py", line 154 in maybe_fallback
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/torch/optim/adam.py", line 223 in step
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/torch/optim/optimizer.py", line 91 in _use_grad
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/torch/optim/optimizer.py", line 487 in wrapper
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/rsl_rl/algorithms/ppo.py", line 386 in update
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/rsl_rl/runners/on_policy_runner.py", line 260 in learn
File "/home/agile/robot_lab/scripts/rsl_rl/base/train.py", line 162 in main
File "/home/agile/IsaacLab/source/isaaclab_tasks/isaaclab_tasks/utils/hydra.py", line 101 in hydra_main
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/hydra/core/utils.py", line 186 in run_job
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 119 in run
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458 in <lambda>
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220 in run_and_report
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457 in _run_app
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394 in _run_hydra
File "/home/agile/miniforge3/envs/env_isaaclab/lib/python3.10/site-packages/hydra/main.py", line 94 in decorated_main
File "/home/agile/IsaacLab/source/isaaclab_tasks/isaaclab_tasks/utils/hydra.py", line 104 in wrapper
File "/home/agile/robot_lab/scripts/rsl_rl/base/train.py", line 170 in <module>
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, hid, PIL._imaging, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, psutil._psutil_linux, psutil._psutil_posix, lxml._elementpath, lxml.etree, scipy.ndimage._nd_image, scipy.ndimage._rank_filter_1d, _ni_label, scipy.ndimage._ni_label, yaml._yaml, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.interpolate._dierckx, scipy.interpolate._ppoly, scipy.interpolate._interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.interpolate._bspl, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.utils, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5o, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5l, h5py._selector, kiwisolver._cext, PIL._imagingft, google.protobuf.pyext._message (total: 126)
Illegal instruction (core dumped)
Thank you for posting this. Could you please try the recommended driver? also, could you share what are you running from the terminal?
Thank you for posting this. Could you please try the recommended driver? also, could you share what are you running from the terminal?
Thank for the response. I have also tried driver 550.144 and also 560, as expressed in document the latest driver is recommended. The running command from the terminal is common like this: "./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/train.py --task Isaac-Velocity-Flat-G1-v0 --headless" . Some changes about rewards are made in the environment.
hi , I have similar issue as yours , do you solve this problem?
Following up, if you still face this issue with the 535.129.03 driver, please open a new issue as a bug report. I will close this issue for now. Thank you for your interest in Isaac Lab.