verl icon indicating copy to clipboard operation
verl copied to clipboard

【bug】Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1720538435607/work/c10/cuda/CUDAException.cpp:43 (most recent call first):

Open yiyepiaoling0715 opened this issue 10 months ago • 8 comments

hot to fix this bug? (WorkerDict pid=53890) (WorkerDict pid=53890) Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1720538435607/work/c10/cuda/CUDAException.cpp:43 (most recent call first): (WorkerDict pid=53890) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fae14cabf86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so) (WorkerDict pid=53890) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fae14c5ad10 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so) (WorkerDict pid=53890) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fae14d87f08 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10_cuda.so) (WorkerDict pid=53890) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fadb91a8bc6 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=53890) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fadb91adde0 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=53890) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fadb91b4a9a in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=53890) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fadb91b6edc in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=53890) frame #7: <unknown function> + 0xdbbf4 (0x7fc6d77c9bf4 in /opt/conda/bin/../lib/libstdc++.so.6) (WorkerDict pid=53890) frame #8: <unknown function> + 0x94ac3 (0x7fc6d9bf0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=53890) frame #9: <unknown function> + 0x126850 (0x7fc6d9c82850 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=53890) (WorkerDict pid=53890) [2025-02-19 09:20:53,418 E 53890 59226] logging.cc:108: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 9 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered (WorkerDict pid=53890) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=53890) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=53890) Compile with TORCH_USE_CUDA_DSAto enable device-side assertions. (WorkerDict pid=53890) (WorkerDict pid=53890) Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1720538435607/work/c10/cuda/CUDAException.cpp:43 (most recent call first): (WorkerDict pid=53890) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fae14cabf86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so) (WorkerDict pid=53890) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fae14c5ad10 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so) (WorkerDict pid=53890) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fae14d87f08 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10_cuda.so) (WorkerDict pid=53890) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fadb91a8bc6 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=53890) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fadb91adde0 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=53890) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fadb91b4a9a in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=53890) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fadb91b6edc in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=53890) frame #7: <unknown function> + 0xdbbf4 (0x7fc6d77c9bf4 in /opt/conda/bin/../lib/libstdc++.so.6) (WorkerDict pid=53890) frame #8: <unknown function> + 0x94ac3 (0x7fc6d9bf0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=53890) frame #9: <unknown function> + 0x126850 (0x7fc6d9c82850 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=53890) (WorkerDict pid=53890) Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1720538435607/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first): (WorkerDict pid=53890) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fae14cabf86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so) (WorkerDict pid=53890) frame #1: <unknown function> + 0xe3ec34 (0x7fadb8e36c34 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=53890) frame #2: <unknown function> + 0xdbbf4 (0x7fc6d77c9bf4 in /opt/conda/bin/../lib/libstdc++.so.6) (WorkerDict pid=53890) frame #3: <unknown function> + 0x94ac3 (0x7fc6d9bf0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=53890) frame #4: <unknown function> + 0x126850 (0x7fc6d9c82850 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=53890) (WorkerDict pid=53890) [2025-02-19 09:20:53,441 E 53890 59226] logging.cc:115: Stack trace: (WorkerDict pid=53890) /opt/conda/lib/python3.11/site-packages/ray/_raylet.so(+0x135bf3a) [0x7fc6d8c6cf3a] ray::operator<<() (WorkerDict pid=53890) /opt/conda/lib/python3.11/site-packages/ray/_raylet.so(+0x135f4c2) [0x7fc6d8c704c2] ray::TerminateHandler() (WorkerDict pid=53890) /opt/conda/bin/../lib/libstdc++.so.6(+0xb135a) [0x7fc6d779f35a] __cxxabiv1::__terminate() (WorkerDict pid=53890) /opt/conda/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7fc6d779f3c5] (WorkerDict pid=53890) /opt/conda/bin/../lib/libstdc++.so.6(+0xb134f) [0x7fc6d779f34f] (WorkerDict pid=53890) /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so(+0xe3ece5) [0x7fadb8e36ce5] c10d::ProcessGroupNCCL::ncclCommWatchdog() (WorkerDict pid=53890) /opt/conda/bin/../lib/libstdc++.so.6(+0xdbbf4) [0x7fc6d77c9bf4] execute_native_thread_routine (WorkerDict pid=53890) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc6d9bf0ac3] (WorkerDict pid=53890) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7fc6d9c82850] (WorkerDict pid=53890) (WorkerDict pid=53890) *** SIGABRT received at time=1739956853 on cpu 100 *** (WorkerDict pid=53890) PC: @ 0x7fc6d9bf29fc (unknown) pthread_kill (WorkerDict pid=53890) @ 0x7fc6d9b9e520 (unknown) (unknown) (WorkerDict pid=53890) [2025-02-19 09:20:53,441 E 53890 59226] logging.cc:460: *** SIGABRT received at time=1739956853 on cpu 100 *** (WorkerDict pid=53890) [2025-02-19 09:20:53,441 E 53890 59226] logging.cc:460: PC: @ 0x7fc6d9bf29fc (unknown) pthread_kill (WorkerDict pid=53890) [2025-02-19 09:20:53,441 E 53890 59226] logging.cc:460: @ 0x7fc6d9b9e520 (unknown) (unknown) (WorkerDict pid=53890) Fatal Python error: Aborted (WorkerDict pid=53890) (WorkerDict pid=53890) (WorkerDict pid=53890) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, zstandard.backend_c, charset_normalizer.md, uvloop.loop, ray._raylet, numpy._core._multiarray_umath, numpy.linalg._umath_linalg, pyarrow.lib, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, gmpy2.gmpy2, markupsafe._speedups, PIL._imaging, sklearn.__check_build._check_build, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.interpolate._dierckx, scipy.interpolate._ppoly, scipy.interpolate._interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.interpolate._bspl, scipy.special.cython_special, scipy.stats._stats, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._biasedurn, scipy.stats._stats_pythran, scipy.stats._levy_stable.levyst, scipy.stats._ansari_swilk_statistics, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.ndimage._nd_image, scipy.ndimage._rank_filter_1d, _ni_label, scipy.ndimage._ni_label, sklearn.utils._isfinite, sklearn.utils.sparsefuncs_fast, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, PIL._imagingft, regex._regex, msgspec._core, sentencepiece._sentencepiece, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, pyarrow._json, zmq.backend.cython._zmq, cuda_utils, __triton_launcher (total: 194) (WorkerDict pid=53889) /opt/conda/lib/python3.11/site-packages/torch/utils/checkpoint.py:1399: FutureWarning:torch.cpu.amp.autocast(args...)is deprecated. Please usetorch.amp.autocast('cpu', args...)` instead. [repeated 3x across cluster] (WorkerDict pid=53889) with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] [repeated 3x across cluster] (WorkerDict pid=53890) /opt/conda/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown (WorkerDict pid=53890) warnings.warn('resource_tracker: There appear to be %d ' (raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff2cc53a05de78c1613ae93ae601000000 Worker ID: 37813cdeb20edfd8cff6f053c4c774a3ae014b39802e760a8eaf7f5d Node ID: dca84f002f26381d2bd5646e02e8e3486dd7204c1f06289f35c04937 Worker IP address: 10.49.104.60 Worker port: 44813 Worker PID: 53890 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=/lpai/volumes/jfs-sc-ep-lf/corpus/public_data/rl_data/verl_gsm8k/train.parquet', 'data.val_files=/lpai/volumes/jfs-sc-ep-lf/corpus/public_data/rl_data/verl_gsm8k/test.parquet', 'data.train_batch_size=1024', 'data.val_batch_size=1312', 'data.max_prompt_length=512', 'data.max_response_length=1024', 'actor_rollout_ref.model.path=/lpai/volumes/zxd-code-complete-lf/data/models/qwen/qwen__qwen2_5-0_5b/24-09-25-1232', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=64', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.001', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.fsdp_config.param_offload=True', 'actor_rollout_ref.actor.fsdp_config.grad_offload=True', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=True', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.6', 'actor_rollout_ref.rollout.n=5', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.default_local_dir=/lpai/volumes/jfs-sc-ep-lf/tmp/verl', 'trainer.critic_warmup=0', 'trainer.logger=[console]', 'trainer.project_name=verl_grpo_example_gsm8k', 'trainer.experiment_name=deepseek_llm_7b_function_rm', 'trainer.n_gpus_per_node=4', 'trainer.nnodes=1', 'trainer.save_freq=-1', 'trainer.test_freq=5', 'trainer.total_epochs=15'] Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/lpai/code/verl/verl/trainer/main_ppo.py", line 129, in main() File "/opt/conda/lib/python3.11/site-packages/hydra/main.py", line 94, in decorated_main _run_hydra( File "/opt/conda/lib/python3.11/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra _run_app( File "/opt/conda/lib/python3.11/site-packages/hydra/_internal/utils.py", line 457, in _run_app run_and_report( File "/opt/conda/lib/python3.11/site-packages/hydra/_internal/utils.py", line 223, in run_and_report raise ex File "/opt/conda/lib/python3.11/site-packages/hydra/_internal/utils.py", line 220, in run_and_report return func() ^^^^^^ File "/opt/conda/lib/python3.11/site-packages/hydra/_internal/utils.py", line 458, in lambda: hydra.run( ^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/hydra/_internal/hydra.py", line 132, in run _ = ret.return_value ^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/hydra/core/utils.py", line 260, in return_value raise self._return_value File "/opt/conda/lib/python3.11/site-packages/hydra/core/utils.py", line 186, in run_job ret.return_value = task_function(task_cfg) ^^^^^^^^^^^^^^^^^^^^^^^ File "/lpai/code/verl/verl/trainer/main_ppo.py", line 23, in main run_ppo(config) File "/lpai/code/verl/verl/trainer/main_ppo.py", line 29, in run_ppo ray.get(main_task.remote(config, compute_score)) File "/opt/conda/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/ray/_private/worker.py", line 2772, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/ray/_private/worker.py", line 919, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(ActorDiedError): ray::main_task() (pid=51520, ip=10.49.104.60) File "/lpai/code/verl/verl/trainer/main_ppo.py", line 125, in main_task trainer.fit() File "/lpai/code/verl/verl/trainer/ppo/ray_trainer.py", line 856, in fit gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lpai/code/verl/verl/single_controller/ray/base.py", line 42, in func output = ray.get(output) ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. class_name: create_colocated_worker_cls..WorkerDict actor_id: 2cc53a05de78c1613ae93ae601000000 pid: 53890 name: vniu4QWorkerDict_0:2 namespace: 0d53a38f-4cf3-4893-a46a-34107d4a42cc ip: 10.49.104.60 The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

  • =4 trainer.nnodes=1 trainer.save_freq=-1 trainer.test_freq=5 trainer.total_epochs=15 run_deepseek14b_llm_debug.sh: line 57: =4: command not fou`

it happens after 5 step (main_task pid=51520) step:5 - global_seqlen/min:992132.000 - global_seqlen/max:1002660.000 - global_seqlen/minmax_diff:10528.000 - global_seqlen/balanced_min:996861.000 - global_seqlen/balanced_max:996862.000 - global_seqlen/mean:996861.250 - actor/kl_loss:0.071 - actor/kl_coef:0.001 - actor/entropy_loss:5.683 - actor/pg_loss:0.000 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.002 - actor/grad_norm:0.180 - mfu/actor:0.083 - actor/lr:0.000 - val/test_score/openai/gsm8k:0.171 - critic/score/mean:0.008 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.008 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.002 - critic/advantages/max:1.789 - critic/advantages/min:-0.730 - critic/returns/mean:-0.002 - critic/returns/max:1.789 - critic/returns/min:-0.730 - response_length/mean:684.393 - response_length/max:1024.000 - response_length/min:3.000 - response_length/clip_ratio:0.521 - prompt_length/mean:94.405 - prompt_length/max:201.000 - prompt_length/min:52.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:262.359 - timing_s/old_log_prob:38.159 - timing_s/ref:39.101 - timing_s/adv:1.931 - timing_s/update_actor:157.615 - timing_s/testing:78.133 - timing_s/step:577.437 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/gen:0.075 - timing_per_token_ms/ref:0.010 - timing_per_token_ms/update_actor:0.040 (main_task pid=51520) step:6 - global_seqlen/min:1011457.000 - global_seqlen/max:1041570.000 - global_seqlen/minmax_diff:30113.000 - global_seqlen/balanced_min:1025392.000 - global_seqlen/balanced_max:1025392.000 - global_seqlen/mean:1025392.000 - actor/kl_loss:0.089 - actor/kl_coef:0.001 - actor/entropy_loss:6.497 - actor/pg_loss:-0.005 - actor/pg_clipfrac:0.001 - actor/ppo_kl:0.002 - actor/grad_norm:0.230 - mfu/actor:0.087 - actor/lr:0.000 - critic/score/mean:0.015 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.015 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.001 - critic/advantages/max:1.789 - critic/advantages/min:-0.730 - critic/returns/mean:-0.001 - critic/returns/max:1.789 - critic/returns/min:-0.730 - response_length/mean:708.116 - response_length/max:1024.000 - response_length/min:2.000 - response_length/clip_ratio:0.570 - prompt_length/mean:92.972 - prompt_length/max:207.000 - prompt_length/min:54.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:260.200 - timing_s/old_log_prob:37.754 - timing_s/ref:42.260 - timing_s/adv:2.019 - timing_s/update_actor:154.307 - timing_s/step:496.874 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/gen:0.072 - timing_per_token_ms/ref:0.010 - timing_per_token_ms/update_actor:0.038 (main_task pid=51520) step:7 - global_seqlen/min:1075864.000 - global_seqlen/max:1104672.000 - global_seqlen/minmax_diff:28808.000 - global_seqlen/balanced_min:1090699.000 - global_seqlen/balanced_max:1090700.000 - global_seqlen/mean:1090699.250 - actor/kl_loss:0.097 - actor/kl_coef:0.001 - actor/entropy_loss:7.181 - actor/pg_loss:0.006 - actor/pg_clipfrac:0.001 - actor/ppo_kl:0.001 - actor/grad_norm:0.265 - mfu/actor:0.092 - actor/lr:0.000 - critic/score/mean:0.029 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.029 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.005 - critic/advantages/max:1.789 - critic/advantages/min:-1.095 - critic/returns/mean:-0.005 - critic/returns/max:1.789 - critic/returns/min:-1.095 - response_length/mean:758.767 - response_length/max:1024.000 - response_length/min:2.000 - response_length/clip_ratio:0.641 - prompt_length/mean:93.342 - prompt_length/max:202.000 - prompt_length/min:55.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:278.585 - timing_s/old_log_prob:37.637 - timing_s/ref:39.553 - timing_s/adv:2.095 - timing_s/update_actor:155.385 - timing_s/step:513.398 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/gen:0.072 - timing_per_token_ms/ref:0.009 - timing_per_token_ms/update_actor:0.036 (main_task pid=51520) step:8 - global_seqlen/min:1089924.000 - global_seqlen/max:1105377.000 - global_seqlen/minmax_diff:15453.000 - global_seqlen/balanced_min:1095501.000 - global_seqlen/balanced_max:1095502.000 - global_seqlen/mean:1095501.750 - actor/kl_loss:0.131 - actor/kl_coef:0.001 - actor/entropy_loss:7.242 - actor/pg_loss:0.016 - actor/pg_clipfrac:0.001 - actor/ppo_kl:0.001 - actor/grad_norm:0.384 - mfu/actor:0.089 - actor/lr:0.000 - critic/score/mean:0.054 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.054 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.004 - critic/advantages/max:1.789 - critic/advantages/min:-1.789 - critic/returns/mean:-0.004 - critic/returns/max:1.789 - critic/returns/min:-1.789 - response_length/mean:762.490 - response_length/max:1024.000 - response_length/min:2.000 - response_length/clip_ratio:0.651 - prompt_length/mean:93.371 - prompt_length/max:196.000 - prompt_length/min:55.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:282.827 - timing_s/old_log_prob:38.105 - timing_s/ref:41.068 - timing_s/adv:2.358 - timing_s/update_actor:161.660 - timing_s/step:526.157 - timing_per_token_ms/adv:0.001 - timing_per_token_ms/gen:0.072 - timing_per_token_ms/ref:0.009 - timing_per_token_ms/update_actor:0.037 (main_task pid=51520) step:9 - global_seqlen/min:1108751.000 - global_seqlen/max:1155135.000 - global_seqlen/minmax_diff:46384.000 - global_seqlen/balanced_min:1128299.000 - global_seqlen/balanced_max:1128300.000 - global_seqlen/mean:1128299.250 - actor/kl_loss:0.120 - actor/kl_coef:0.001 - actor/entropy_loss:7.038 - actor/pg_loss:-0.003 - actor/pg_clipfrac:0.001 - actor/ppo_kl:0.001 - actor/grad_norm:0.410 - mfu/actor:0.093 - actor/lr:0.000 - critic/score/mean:0.095 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.095 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.007 - critic/advantages/max:1.789 - critic/advantages/min:-1.789 - critic/returns/mean:-0.007 - critic/returns/max:1.789 - critic/returns/min:-1.789 - response_length/mean:787.399 - response_length/max:1024.000 - response_length/min:2.000 - response_length/clip_ratio:0.676 - prompt_length/mean:94.085 - prompt_length/max:246.000 - prompt_length/min:46.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:297.897 - timing_s/old_log_prob:42.084 - timing_s/ref:41.627 - timing_s/adv:2.493 - timing_s/update_actor:159.831 - timing_s/step:544.060 - timing_per_token_ms/adv:0.001 - timing_per_token_ms/gen:0.074 - timing_per_token_ms/ref:0.009 - timing_per_token_ms/update_actor:0.035 (WorkerDict pid=53890) INFO 02-19 09:20:53 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250219-092053.pkl... (WorkerDict pid=53890) WARNING 02-19 09:20:53 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered (WorkerDict pid=53890) WARNING 02-19 09:20:53 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=53890) WARNING 02-19 09:20:53 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=53890) WARNING 02-19 09:20:53 model_runner_base.py:143] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. (WorkerDict pid=53890) WARNING 02-19 09:20:53 model_runner_base.py:143] (WorkerDict pid=53890) [rank2]:[E219 09:20:53.692926346 ProcessGroupNCCL.cpp:1515] [PG 9 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered

yiyepiaoling0715 avatar Feb 19 '25 09:02 yiyepiaoling0715

could you try export VLLM_ATTENTION_BACKEND=XFORMERS before launching the ray job

eric-haibin-lin avatar Feb 21 '25 03:02 eric-haibin-lin

still has error (main_task pid=80995) step:4 - global_seqlen/min:1165716.000 - global_seqlen/max:1178432.000 - global_seqlen/minmax_diff:12716.000 - global_seqlen/balanced_min:1171998.000 - global_seqlen/balanced_max:1171998.000 - global_seqlen/mean:1171998.000 - actor/kl_loss:0.347 - actor/kl_coef:0.001 - actor/entropy_loss:9.025 - actor/pg_loss:0.002 - actor/pg_clipfrac:0.001 - actor/ppo_kl:0.027 - actor/grad_norm:0.069 - mfu/actor:0.064 - actor/lr:0.000 - critic/score/mean:0.001 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.001 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.000 - critic/advantages/max:1.789 - critic/advantages/min:-0.447 - critic/returns/mean:-0.000 - critic/returns/max:1.789 - critic/returns/min:-0.447 - response_length/mean:821.931 - response_length/max:1024.000 - response_length/min:2.000 - response_length/clip_ratio:0.715 - prompt_length/mean:93.692 - prompt_length/max:228.000 - prompt_length/min:46.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:371.701 - timing_s/old_log_prob:56.238 - timing_s/ref:60.404 - timing_s/adv:2.177 - timing_s/update_actor:237.502 - timing_s/step:728.324 - timing_per_token_ms/update_actor:0.051 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/ref:0.013 - timing_per_token_ms/gen:0.088 (main_task pid=80995) gen_batch.shape: torch.Size([1024, 512]) (main_task pid=80995) before union gen_batch_output.shape: torch.Size([5120, 1536]) (main_task pid=80995) batch.shape: torch.Size([5120, 1536]),gen_batch_output.shape: torch.Size([5120, 1536]) (WorkerDict pid=93658) [rank2]:[E221 04:32:47.671648643 ProcessGroupNCCL.cpp:1515] [PG 9 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered (WorkerDict pid=93658) Compile with TORCH_USE_CUDA_DSAto enable device-side assertions. (WorkerDict pid=93658) (WorkerDict pid=93658) Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1720538435607/work/c10/cuda/CUDAException.cpp:43 (most recent call first): (WorkerDict pid=93658) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbbc04d2f86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so) (WorkerDict pid=93658) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbbc0481d10 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so) (WorkerDict pid=93658) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbbc05aef08 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10_cuda.so) (WorkerDict pid=93658) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fbb66498bc6 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=93658) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fbb6649dde0 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=93658) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fbb664a4a9a in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=93658) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbb664a6edc in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=93658) frame #7: <unknown function> + 0xdbbf4 (0x7fcaf15c6bf4 in /opt/conda/bin/../lib/libstdc++.so.6) (WorkerDict pid=93658) frame #8: <unknown function> + 0x94ac3 (0x7fcaf39edac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=93658) frame #9: <unknown function> + 0x126850 (0x7fcaf3a7f850 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=93658) (WorkerDict pid=93658) [2025-02-21 04:32:47,363 E 93658 108927] logging.cc:108: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 9 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered (WorkerDict pid=93658) Compile withTORCH_USE_CUDA_DSAto enable device-side assertions. (WorkerDict pid=93658) (WorkerDict pid=93658) Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1720538435607/work/c10/cuda/CUDAException.cpp:43 (most recent call first): (WorkerDict pid=93658) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbbc04d2f86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so) (WorkerDict pid=93658) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbbc0481d10 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so) (WorkerDict pid=93658) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbbc05aef08 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10_cuda.so) (WorkerDict pid=93658) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fbb66498bc6 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=93658) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fbb6649dde0 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=93658) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fbb664a4a9a in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=93658) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbb664a6edc in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=93658) frame #7: <unknown function> + 0xdbbf4 (0x7fcaf15c6bf4 in /opt/conda/bin/../lib/libstdc++.so.6) (WorkerDict pid=93658) frame #8: <unknown function> + 0x94ac3 (0x7fcaf39edac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=93658) frame #9: <unknown function> + 0x126850 (0x7fcaf3a7f850 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=93658) (WorkerDict pid=93658) Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1720538435607/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first): (WorkerDict pid=93658) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbbc04d2f86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so) (WorkerDict pid=93658) frame #1: <unknown function> + 0xe3ec34 (0x7fbb66126c34 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=93658) frame #2: <unknown function> + 0xdbbf4 (0x7fcaf15c6bf4 in /opt/conda/bin/../lib/libstdc++.so.6) (WorkerDict pid=93658) frame #3: <unknown function> + 0x94ac3 (0x7fcaf39edac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=93658) frame #4: <unknown function> + 0x126850 (0x7fcaf3a7f850 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=93658) (WorkerDict pid=93663) /opt/conda/lib/python3.11/site-packages/torch/utils/checkpoint.py:1399: FutureWarning:torch.cpu.amp.autocast(args...)is deprecated. Please usetorch.amp.autocast('cpu', args...)instead. [repeated 3x across cluster] (WorkerDict pid=93663) with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] [repeated 3x across cluster] (WorkerDict pid=93658) INFO 02-21 04:32:47 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250221-043247.pkl... (WorkerDict pid=93658) WARNING 02-21 04:32:47 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered (WorkerDict pid=93658) WARNING 02-21 04:32:47 model_runner_base.py:143] Compile withTORCH_USE_CUDA_DSAto enable device-side assertions. (WorkerDict pid=93658) WARNING 02-21 04:32:47 model_runner_base.py:143] (WorkerDict pid=93663) (WorkerDict pid=93663) (WorkerDict pid=93663) (WorkerDict pid=93663) (WorkerDict pid=93663) (WorkerDict pid=93658) [2025-02-21 04:32:47,391 E 93658 108927] logging.cc:115: Stack trace: (WorkerDict pid=93658) /opt/conda/lib/python3.11/site-packages/ray/_raylet.so(+0x135bf3a) [0x7fcaf2a69f3a] ray::operator<<() (WorkerDict pid=93658) /opt/conda/lib/python3.11/site-packages/ray/_raylet.so(+0x135f4c2) [0x7fcaf2a6d4c2] ray::TerminateHandler() (WorkerDict pid=93658) /opt/conda/bin/../lib/libstdc++.so.6(+0xb135a) [0x7fcaf159c35a] __cxxabiv1::__terminate() (WorkerDict pid=93658) /opt/conda/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7fcaf159c3c5] (WorkerDict pid=93658) /opt/conda/bin/../lib/libstdc++.so.6(+0xb134f) [0x7fcaf159c34f] (WorkerDict pid=93658) /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so(+0xe3ece5) [0x7fbb66126ce5] c10d::ProcessGroupNCCL::ncclCommWatchdog() (WorkerDict pid=93658) /opt/conda/bin/../lib/libstdc++.so.6(+0xdbbf4) [0x7fcaf15c6bf4] execute_native_thread_routine (WorkerDict pid=93658) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fcaf39edac3] (WorkerDict pid=93658) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7fcaf3a7f850] (WorkerDict pid=93658) (WorkerDict pid=93658) *** SIGABRT received at time=1740112367 on cpu 71 *** (WorkerDict pid=93658) PC: @ 0x7fcaf39ef9fc (unknown) pthread_kill (WorkerDict pid=93658) @ 0x7fcaf399b520 (unknown) (unknown) (WorkerDict pid=93658) [2025-02-21 04:32:47,391 E 93658 108927] logging.cc:460: *** SIGABRT received at time=1740112367 on cpu 71 *** (WorkerDict pid=93658) [2025-02-21 04:32:47,391 E 93658 108927] logging.cc:460: PC: @ 0x7fcaf39ef9fc (unknown) pthread_kill (WorkerDict pid=93658) [2025-02-21 04:32:47,391 E 93658 108927] logging.cc:460: @ 0x7fcaf399b520 (unknown) (unknown) (WorkerDict pid=93658) Fatal Python error: Aborted (WorkerDict pid=93658)

yiyepiaoling0715 avatar Feb 21 '25 06:02 yiyepiaoling0715

WorkerDict pid=93658) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, zstandard.backend_c, charset_normalizer.md, uvloop.loop, ray._raylet, numpy._core._multiarray_umath, numpy.linalg._umath_linalg, pyarrow.lib, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, gmpy2.gmpy2, markupsafe._speedups, PIL._imaging, sklearn.__check_build._check_build, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.interpolate._dierckx, scipy.interpolate._ppoly, scipy.interpolate._interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.interpolate._bspl, scipy.special.cython_special, scipy.stats._stats, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._biasedurn, scipy.stats._stats_pythran, scipy.stats._levy_stable.levyst, scipy.stats._ansari_swilk_statistics, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.ndimage._nd_image, scipy.ndimage._rank_filter_1d, _ni_label, scipy.ndimage._ni_label, sklearn.utils._isfinite, sklearn.utils.sparsefuncs_fast, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, PIL._imagingft, regex._regex, msgspec._core, sentencepiece._sentencepiece, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, pyarrow._json, zmq.backend.cython._zmq, cuda_utils, __triton_launcher (total: 194) (WorkerDict pid=93663) /opt/conda/bin/../lib/libstdc++.so.6(+0xb135a) [0x7fd223bac35a] __cxxabiv1::__terminate() (WorkerDict pid=93663) /opt/conda/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7fd223bac3c5] (WorkerDict pid=93663) /opt/conda/bin/../lib/libstdc++.so.6(+0xb134f) [0x7fd223bac34f] (WorkerDict pid=93663) /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so(+0xe3ece5) [0x7fc298678ce5] c10d::ProcessGroupNCCL::ncclCommWatchdog() (WorkerDict pid=93663) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fd225ffdac3] (WorkerDict pid=93663) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7fd22608f850] (WorkerDict pid=93663) (WorkerDict pid=93663) (WorkerDict pid=93663) (WorkerDict pid=93658) /opt/conda/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown (WorkerDict pid=93658) warnings.warn('resource_tracker: There appear to be %d ' (WorkerDict pid=93663) [rank3]:[E221 04:32:47.671650967 ProcessGroupNCCL.cpp:1515] [PG 9 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered (WorkerDict pid=93663) Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. [repeated 2x across cluster] (WorkerDict pid=93663) Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1720538435607/work/c10/cuda/CUDAException.cpp:43 (most recent call first): [repeated 2x across cluster] (WorkerDict pid=93663) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc2f409cf86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so) [repeated 3x across cluster] (WorkerDict pid=93663) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc2f404bd10 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so) [repeated 2x across cluster] (WorkerDict pid=93663) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc2f4178f08 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10_cuda.so) [repeated 2x across cluster] (WorkerDict pid=93663) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc2989eabc6 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) [repeated 2x across cluster] (WorkerDict pid=93663) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc2989f8edc in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) [repeated 6x across cluster] (WorkerDict pid=93663) frame #4: <unknown function> + 0x126850 (0x7fd22608f850 in /usr/lib/x86_64-linux-gnu/libc.so.6) [repeated 10x across cluster] (WorkerDict pid=93663) [2025-02-21 04:32:47,363 E 93663 108930] logging.cc:108: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 9 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered (WorkerDict pid=93663) Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1720538435607/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first): (WorkerDict pid=93663) [2025-02-21 04:32:47,391 E 93663 108930] logging.cc:115: Stack trace: (WorkerDict pid=93663) /opt/conda/lib/python3.11/site-packages/ray/_raylet.so(+0x135bf3a) [0x7fd225079f3a] ray::operator<<() (WorkerDict pid=93663) /opt/conda/lib/python3.11/site-packages/ray/_raylet.so(+0x135f4c2) [0x7fd22507d4c2] ray::TerminateHandler() (WorkerDict pid=93663) /opt/conda/bin/../lib/libstdc++.so.6(+0xdbbf4) [0x7fd223bd6bf4] execute_native_thread_routine (WorkerDict pid=93663) *** SIGABRT received at time=1740112367 on cpu 70 *** (WorkerDict pid=93663) PC: @ 0x7fd225fff9fc (unknown) pthread_kill (WorkerDict pid=93663) @ 0x7fd225fab520 (unknown) (unknown) (WorkerDict pid=93663) [2025-02-21 04:32:47,391 E 93663 108930] logging.cc:460: *** SIGABRT received at time=1740112367 on cpu 70 *** (WorkerDict pid=93663) [2025-02-21 04:32:47,391 E 93663 108930] logging.cc:460: PC: @ 0x7fd225fff9fc (unknown) pthread_kill (WorkerDict pid=93663) [2025-02-21 04:32:47,391 E 93663 108930] logging.cc:460: @ 0x7fd225fab520 (unknown) (unknown) (WorkerDict pid=93663) Fatal Python error: Aborted

yiyepiaoling0715 avatar Feb 21 '25 06:02 yiyepiaoling0715

(WorkerDict pid=93663) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, zstandard.backend_c, charset_normalizer.md, uvloop.loop, ray._raylet, numpy._core._multiarray_umath, numpy.linalg._umath_linalg, pyarrow.lib, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, gmpy2.gmpy2, markupsafe._speedups, PIL._imaging, sklearn.__check_build._check_build, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.interpolate._dierckx, scipy.interpolate._ppoly, scipy.interpolate._interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.interpolate._bspl, scipy.special.cython_special, scipy.stats._stats, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._biasedurn, scipy.stats._stats_pythran, scipy.stats._levy_stable.levyst, scipy.stats._ansari_swilk_statistics, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.ndimage._nd_image, scipy.ndimage._rank_filter_1d, _ni_label, scipy.ndimage._ni_label, sklearn.utils._isfinite, sklearn.utils.sparsefuncs_fast, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, PIL._imagingft, regex._regex, msgspec._core, sentencepiece._sentencepiece, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, pyarrow._json, zmq.backend.cython._zmq, cuda_utils, __triton_launcher (total: 194) (raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffa1c8eaa4c4d82b37803c0cd101000000 Worker ID: c11941743afb2844fc911f07a15f71a4b06918218e0b932276ce5255 Node ID: c841facebde7fc045123353bfbf0a44fe3eddd50189e48864242b3e7 Worker IP address: 10.49.144.10 Worker port: 32827 Worker PID: 93658 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. (WorkerDict pid=93663) INFO 02-21 04:32:47 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250221-043247.pkl... (WorkerDict pid=93663) WARNING 02-21 04:32:47 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered (WorkerDict pid=93663) WARNING 02-21 04:32:47 model_runner_base.py:143] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. (WorkerDict pid=93663) WARNING 02-21 04:32:47 model_runner_base.py:143] Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=/lpai/volumes/jfs-sc-ep-lf/corpus/public_data/rl_data/verl_gsm8k/train.parquet', 'data.val_files=/lpai/volumes/jfs-sc-ep-lf/corpus/public_data/rl_data/verl_gsm8k/test.parquet', 'data.train_batch_size=1024', 'data.val_batch_size=1312', 'data.max_prompt_length=512', 'data.max_response_length=1024', 'actor_rollout_ref.model.path=/lpai/volumes/zxd-code-complete-lf/data/models/qwen/qwen__qwen2_5-0_5b/24-09-25-1232', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=8', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.001', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.fsdp_config.param_offload=True', 'actor_rollout_ref.actor.fsdp_config.grad_offload=True', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=True', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.6', 'actor_rollout_ref.rollout.n=5', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.default_local_dir=/lpai/volumes/jfs-sc-ep-lf/tmp/verl', 'trainer.critic_warmup=0', 'trainer.logger=[console,swanlab]', 'trainer.project_name=verl_grpo_example_gsm8k', 'trainer.experiment_name=deepseek_llm_7b_function_rm', 'trainer.n_gpus_per_node=4', 'trainer.nnodes=1', 'trainer.save_freq=-1', 'trainer.test_freq=5', 'trainer.total_epochs=15']

yiyepiaoling0715 avatar Feb 21 '25 06:02 yiyepiaoling0715

Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/lpai/code/verl/verl/trainer/main_ppo.py", line 129, in <module> main() File "/opt/conda/lib/python3.11/site-packages/hydra/main.py", line 94, in decorated_main _run_hydra( File "/opt/conda/lib/python3.11/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra _run_app( File "/opt/conda/lib/python3.11/site-packages/hydra/_internal/utils.py", line 457, in _run_app run_and_report( File "/opt/conda/lib/python3.11/site-packages/hydra/_internal/utils.py", line 223, in run_and_report raise ex File "/opt/conda/lib/python3.11/site-packages/hydra/_internal/utils.py", line 220, in run_and_report return func() ^^^^^^ File "/opt/conda/lib/python3.11/site-packages/hydra/_internal/utils.py", line 458, in <lambda> lambda: hydra.run( ^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/hydra/_internal/hydra.py", line 132, in run _ = ret.return_value ^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/hydra/core/utils.py", line 260, in return_value raise self._return_value File "/opt/conda/lib/python3.11/site-packages/hydra/core/utils.py", line 186, in run_job ret.return_value = task_function(task_cfg) ^^^^^^^^^^^^^^^^^^^^^^^ File "/lpai/code/verl/verl/trainer/main_ppo.py", line 23, in main run_ppo(config) File "/lpai/code/verl/verl/trainer/main_ppo.py", line 29, in run_ppo ray.get(main_task.remote(config, compute_score)) File "/opt/conda/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/ray/_private/worker.py", line 2772, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/ray/_private/worker.py", line 919, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(ActorDiedError): ray::main_task() (pid=80995, ip=10.49.144.10) File "/lpai/code/verl/verl/trainer/main_ppo.py", line 125, in main_task trainer.fit() File "/lpai/code/verl/verl/trainer/ppo/ray_trainer.py", line 956, in fit val_metrics: dict = self._validate() ^^^^^^^^^^^^^^^^ File "/lpai/code/verl/verl/trainer/ppo/ray_trainer.py", line 603, in _validate test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lpai/code/verl/verl/single_controller/ray/base.py", line 42, in func output = ray.get(output) ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. class_name: create_colocated_worker_cls.<locals>.WorkerDict actor_id: a1c8eaa4c4d82b37803c0cd101000000 pid: 93658 name: CFjVWSWorkerDict_0:2 namespace: f59a5a76-e3b6-44c6-a851-cd9cd27790f6 ip: 10.49.144.10 The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

yiyepiaoling0715 avatar Feb 21 '25 06:02 yiyepiaoling0715

branch=v0.2.0.post1

yiyepiaoling0715 avatar Feb 21 '25 06:02 yiyepiaoling0715

same here

asirgogogo avatar Feb 27 '25 11:02 asirgogogo

same here, running in multi-nodes (4*4A100-80G)

fyqqyf avatar Mar 06 '25 01:03 fyqqyf

try: export VLLM_USE_V1=0

Andrewzh112 avatar Mar 08 '25 23:03 Andrewzh112

please avoid using vllm 0.7.x. v0.6.3 and v0.8.2 is stable. for new cases, please report your command and diagnosis result from https://github.com/volcengine/verl/blob/main/scripts/diagnose.py

eric-haibin-lin avatar Apr 06 '25 19:04 eric-haibin-lin

same here, and i have tried export VLLM_ATTENTION_BACKEND=XFORMERS and export VLLM_USE_V1=0. Both did not work. But this error will not appear if i only use one gpu.

javyduck avatar Apr 09 '25 10:04 javyduck

same problem here

yuleiqin avatar May 10 '25 00:05 yuleiqin

Any fixes for this yet?

Datta0 avatar Jul 05 '25 16:07 Datta0

Same problem here. Any new fixes or advice?

jiangzizi avatar Jul 21 '25 10:07 jiangzizi

Is there a solution yet? I hope you will not hesitate to guide me

SimonHeye avatar Sep 24 '25 17:09 SimonHeye