I think dont put compute_response_mask in trainer.py, otherwise response_mask may not contiguous, put it in deamon.py?
batch data I crashed by many times in batch.to(device) may because fsdp? DataProto(batch=TensorDict( [36m(TaskRunner pid=73703)[0m fields={ [36m(TaskRunner pid=73703)[0m attention_mask: Tensor(shape=torch.Size([56, 16384]), device=cpu, dtype=torch.int64, is_shared=False), [36m(TaskRunner pid=73703)[0m input_ids: Tensor(shape=torch.Size([56, 16384]), device=cpu, dtype=torch.int64, is_shared=False), [36m(TaskRunner pid=73703)[0m is_drop_mask: Tensor(shape=torch.Size([56]), device=cpu, dtype=torch.bool, is_shared=False), [36m(TaskRunner pid=73703)[0m position_ids: Tensor(shape=torch.Size([56, 16384]), device=cpu, dtype=torch.int64, is_shared=False), [36m(TaskRunner pid=73703)[0m prompts: Tensor(shape=torch.Size([56, 15360]), device=cpu, dtype=torch.int64, is_shared=False), [36m(TaskRunner pid=73703)[0m response_mask: Tensor(shape=torch.Size([56, 1024]), device=cpu, dtype=torch.int64, is_shared=False), [36m(TaskRunner pid=73703)[0m responses: Tensor(shape=torch.Size([56, 1024]), device=cpu, dtype=torch.int64, is_shared=False), [36m(TaskRunner pid=73703)[0m token_level_scores: Tensor(shape=torch.Size([56, 1024]), device=cpu, dtype=torch.bfloat16, is_shared=False)}, [36m(TaskRunner pid=73703)[0m batch_size=torch.Size([56]), [36m(TaskRunner pid=73703)[0m device=None, Error MSG: ... Exception in thread Thread-3 (_loop_forever): [36m(WorkerDict pid=74817)[0m Traceback (most recent call last): [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/threading.py", line 1075, in _bootstrap_inner [36m(WorkerDict pid=74817)[0m self.run() [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/threading.py", line 1012, in run [36m(WorkerDict pid=74817)[0m self._target(*self._args, **self._kwargs) [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 453, in _loop_forever [36m(WorkerDict pid=74817)[0m result = self.execute_method(method, *args, **kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 501, in execute_method [36m(WorkerDict pid=74817)[0m return self.inference_engine.execute_method(method, *args, **kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 628, in execute_method [36m(WorkerDict pid=74817)[0m raise e [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 619, in execute_method [36m(WorkerDict pid=74817)[0m return run_method(self, method, args, kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/utils/init.py", line 3060, in run_method [36m(WorkerDict pid=74817)[0m return func(*args, **kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context [36m(WorkerDict pid=74817)[0m return func(*args, **kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 436, in execute_model [36m(WorkerDict pid=74817)[0m output = self.model_runner.execute_model(scheduler_output, [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context [36m(WorkerDict pid=74817)[0m return func(*args, **kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2129, in execute_model [36m(WorkerDict pid=74817)[0m ) = self._bookkeeping_sync(scheduler_output, sampler_output, [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1929, in _bookkeeping_sync [36m(WorkerDict pid=74817)[0m valid_sampled_token_ids = self._to_list(sampled_token_ids) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3742, in _to_list [36m(WorkerDict pid=74817)[0m self.transfer_event.synchronize() [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/torch/cuda/streams.py", line 231, in synchronize [36m(WorkerDict pid=74817)[0m super().synchronize() [36m(WorkerDict pid=74817)[0m torch.AcceleratorError: CUDA error: an illegal memory access was encountered Exception in thread Thread-3 (_loop_forever): [36m(WorkerDict pid=74817)[0m Traceback (most recent call last): [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/threading.py", line 1075, in _bootstrap_inner [36m(WorkerDict pid=74817)[0m self.run() [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/threading.py", line 1012, in run [36m(WorkerDict pid=74817)[0m self._target(*self._args, **self._kwargs) [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 453, in _loop_forever [36m(WorkerDict pid=74817)[0m result = self.execute_method(method, *args, **kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 501, in execute_method [36m(WorkerDict pid=74817)[0m return self.inference_engine.execute_method(method, *args, **kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 628, in execute_method [36m(WorkerDict pid=74817)[0m raise e [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 619, in execute_method [36m(WorkerDict pid=74817)[0m return run_method(self, method, args, kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/utils/init.py", line 3060, in run_method [36m(WorkerDict pid=74817)[0m return func(*args, **kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context [36m(WorkerDict pid=74817)[0m return func(*args, **kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 436, in execute_model [36m(WorkerDict pid=74817)[0m output = self.model_runner.execute_model(scheduler_output, [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context [36m(WorkerDict pid=74817)[0m return func(*args, **kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2129, in execute_model [36m(WorkerDict pid=74817)[0m ) = self._bookkeeping_sync(scheduler_output, sampler_output, [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1929, in _bookkeeping_sync [36m(WorkerDict pid=74817)[0m valid_sampled_token_ids = self._to_list(sampled_token_ids) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3742, in _to_list [36m(WorkerDict pid=74817)[0m self.transfer_event.synchronize() [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/torch/cuda/streams.py", line 231, in synchronize [36m(WorkerDict pid=74817)[0m super().synchronize() [36m(WorkerDict pid=74817)[0m torch.AcceleratorError: CUDA error: an illegal memory access was encountered
Still cant find the real reason
the only reason i found by Ray debuger is this. I think batch.to(device) is may not cause CUDA OOM, it havent start compute even.
Not this reason
i confirmed the problem is non_blocking=True. change tensordict package base.py
storage_cast = storage.to(device, non_blocking=False)
still not this reason,maybe need sglang backend.
What's your environment, and in what step does it go wrong? (training/validation/first step/after a few steps)
Do you have multiple GPUs and multiple nodes?
What's your environment, and in what step does it go wrong? (training/validation/first step/after a few steps)
Do you have multiple GPUs and multiple nodes?
one node 8*H20
- torch: 2.8.0+cu128
- verl: version: v0.5.0
- vllm: version: v0.10.2
- flash-attn version: v2.8.3 usually, it randomly failed in training step when DataProto move tensordict from cpu to gpu. which in verl is the code line self.batch.to(device), also found failed in load_fsdp_model_to_gpu(self.actor_module_fsdp), but i think the reason is these lines in VLLM:
File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1929, in _bookkeeping_sync
valid_sampled_token_ids = self._to_list(sampled_token_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3742, in _to_list
self.transfer_event.synchronize()
File "/opt/conda/lib/python3.12/site-packages/torch/cuda/streams.py", line 231, in synchronize
super().synchronize()
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
What's your environment, and in what step does it go wrong? (training/validation/first step/after a few steps) Do you have multiple GPUs and multiple nodes?
- torch: 2.8.0+cu128
- verl: version: v0.5.0
- vllm: version: v0.10.2
- flash-attn version: v2.8.3 usually, it randomly failed in training step when DataProto move tensordict from cpu to gpu. which in verl is the code line self.batch.to(device), also found failed in load_fsdp_model_to_gpu(self.actor_module_fsdp), but i think the reason is these lines in VLLM:
File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1929, in _bookkeeping_sync valid_sampled_token_ids = self._to_list(sampled_token_ids) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3742, in _to_list self.transfer_event.synchronize() File "/opt/conda/lib/python3.12/site-packages/torch/cuda/streams.py", line 231, in synchronize super().synchronize() torch.AcceleratorError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with
TORCH_USE_CUDA_DSAto enable device-side assertions.
Occasional failures caused by unsafe CUDA memory operations? I am not sure.
In the near future I plan to try to implement an agentlightning version with sglang as the backend to see if this problem still occurs.
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
This looks like a GPU OOM error to me.
In the near future I plan to try to implement an agentlightning version with sglang as the backend to see if this problem still occurs.
Good luck to that! Welcome to contribute back if you have made some progress. 👍
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
This looks like a GPU OOM error to me.
In the near future I plan to try to implement an agentlightning version with sglang as the backend to see if this problem still occurs.
Good luck to that! Welcome to contribute back if you have made some progress. 👍
verl canceled chat_completion design in latest version which is very inconvenient, i think its time to change a RL backend like AReaL
verl canceled chat_completion design in latest version which is very inconvenient
Surprised to know. Seems that we need to figure out a plan. Either using verl in a different way, or switching to a different framework. Will look into AReal.
verl canceled chat_completion design in latest version which is very inconvenient
Surprised to know. Seems that we need to figure out a plan. Either using verl in a different way, or switching to a different framework. Will look into AReal.
After switch to sglang, training became faster and more stable. If I have time, I will try to PR a sglang version soon.