agent-lightning I think dont put compute_response_mask in trainer.py, otherwise response_mask may not contiguous, put it in deamon.py?

batch data I crashed by many times in batch.to(device) may because fsdp? DataProto(batch=TensorDict( [36m(TaskRunner pid=73703)[0m fields={ [36m(TaskRunner pid=73703)[0m attention_mask: Tensor(shape=torch.Size([56, 16384]), device=cpu, dtype=torch.int64, is_shared=False), [36m(TaskRunner pid=73703)[0m input_ids: Tensor(shape=torch.Size([56, 16384]), device=cpu, dtype=torch.int64, is_shared=False), [36m(TaskRunner pid=73703)[0m is_drop_mask: Tensor(shape=torch.Size([56]), device=cpu, dtype=torch.bool, is_shared=False), [36m(TaskRunner pid=73703)[0m position_ids: Tensor(shape=torch.Size([56, 16384]), device=cpu, dtype=torch.int64, is_shared=False), [36m(TaskRunner pid=73703)[0m prompts: Tensor(shape=torch.Size([56, 15360]), device=cpu, dtype=torch.int64, is_shared=False), [36m(TaskRunner pid=73703)[0m response_mask: Tensor(shape=torch.Size([56, 1024]), device=cpu, dtype=torch.int64, is_shared=False), [36m(TaskRunner pid=73703)[0m responses: Tensor(shape=torch.Size([56, 1024]), device=cpu, dtype=torch.int64, is_shared=False), [36m(TaskRunner pid=73703)[0m token_level_scores: Tensor(shape=torch.Size([56, 1024]), device=cpu, dtype=torch.bfloat16, is_shared=False)}, [36m(TaskRunner pid=73703)[0m batch_size=torch.Size([56]), [36m(TaskRunner pid=73703)[0m device=None, Error MSG: ... Exception in thread Thread-3 (_loop_forever): [36m(WorkerDict pid=74817)[0m Traceback (most recent call last): [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/threading.py", line 1075, in _bootstrap_inner [36m(WorkerDict pid=74817)[0m self.run() [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/threading.py", line 1012, in run [36m(WorkerDict pid=74817)[0m self._target(*self._args, **self._kwargs) [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 453, in _loop_forever [36m(WorkerDict pid=74817)[0m result = self.execute_method(method, *args, **kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 501, in execute_method [36m(WorkerDict pid=74817)[0m return self.inference_engine.execute_method(method, *args, **kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 628, in execute_method [36m(WorkerDict pid=74817)[0m raise e [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 619, in execute_method [36m(WorkerDict pid=74817)[0m return run_method(self, method, args, kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/utils/init.py", line 3060, in run_method [36m(WorkerDict pid=74817)[0m return func(*args, **kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context [36m(WorkerDict pid=74817)[0m return func(*args, **kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 436, in execute_model [36m(WorkerDict pid=74817)[0m output = self.model_runner.execute_model(scheduler_output, [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context [36m(WorkerDict pid=74817)[0m return func(*args, **kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2129, in execute_model [36m(WorkerDict pid=74817)[0m ) = self._bookkeeping_sync(scheduler_output, sampler_output, [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1929, in _bookkeeping_sync [36m(WorkerDict pid=74817)[0m valid_sampled_token_ids = self._to_list(sampled_token_ids) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3742, in _to_list [36m(WorkerDict pid=74817)[0m self.transfer_event.synchronize() [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/torch/cuda/streams.py", line 231, in synchronize [36m(WorkerDict pid=74817)[0m super().synchronize() [36m(WorkerDict pid=74817)[0m torch.AcceleratorError: CUDA error: an illegal memory access was encountered Exception in thread Thread-3 (_loop_forever): [36m(WorkerDict pid=74817)[0m Traceback (most recent call last): [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/threading.py", line 1075, in _bootstrap_inner [36m(WorkerDict pid=74817)[0m self.run() [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/threading.py", line 1012, in run [36m(WorkerDict pid=74817)[0m self._target(*self._args, **self._kwargs) [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 453, in _loop_forever [36m(WorkerDict pid=74817)[0m result = self.execute_method(method, *args, **kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 501, in execute_method [36m(WorkerDict pid=74817)[0m return self.inference_engine.execute_method(method, *args, **kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 628, in execute_method [36m(WorkerDict pid=74817)[0m raise e [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 619, in execute_method [36m(WorkerDict pid=74817)[0m return run_method(self, method, args, kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/utils/init.py", line 3060, in run_method [36m(WorkerDict pid=74817)[0m return func(*args, **kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context [36m(WorkerDict pid=74817)[0m return func(*args, **kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 436, in execute_model [36m(WorkerDict pid=74817)[0m output = self.model_runner.execute_model(scheduler_output, [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context [36m(WorkerDict pid=74817)[0m return func(*args, **kwargs) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2129, in execute_model [36m(WorkerDict pid=74817)[0m ) = self._bookkeeping_sync(scheduler_output, sampler_output, [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1929, in _bookkeeping_sync [36m(WorkerDict pid=74817)[0m valid_sampled_token_ids = self._to_list(sampled_token_ids) [36m(WorkerDict pid=74817)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3742, in _to_list [36m(WorkerDict pid=74817)[0m self.transfer_event.synchronize() [36m(WorkerDict pid=74817)[0m File "/opt/conda/lib/python3.12/site-packages/torch/cuda/streams.py", line 231, in synchronize [36m(WorkerDict pid=74817)[0m super().synchronize() [36m(WorkerDict pid=74817)[0m torch.AcceleratorError: CUDA error: an illegal memory access was encountered

Sep 30 '25 10:09 af-74413592

Still cant find the real reason

Sep 30 '25 10:09 af-74413592

the only reason i found by Ray debuger is this. I think batch.to(device) is may not cause CUDA OOM, it havent start compute even.

Sep 30 '25 10:09 af-74413592

Not this reason

Sep 30 '25 11:09 af-74413592

i confirmed the problem is non_blocking=True. change tensordict package base.py
storage_cast = storage.to(device, non_blocking=False)

Sep 30 '25 13:09 af-74413592

still not this reason，maybe need sglang backend.

Oct 01 '25 01:10 af-74413592

What's your environment, and in what step does it go wrong? (training/validation/first step/after a few steps)

Do you have multiple GPUs and multiple nodes?

Oct 01 '25 05:10 ultmaster

What's your environment, and in what step does it go wrong? (training/validation/first step/after a few steps)

Do you have multiple GPUs and multiple nodes?

one node 8*H20

torch: 2.8.0+cu128
verl: version: v0.5.0
vllm: version: v0.10.2
flash-attn version: v2.8.3 usually, it randomly failed in training step when DataProto move tensordict from cpu to gpu. which in verl is the code line self.batch.to(device), also found failed in load_fsdp_model_to_gpu(self.actor_module_fsdp), but i think the reason is these lines in VLLM:

File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1929, in _bookkeeping_sync valid_sampled_token_ids = self._to_list(sampled_token_ids) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3742, in _to_list self.transfer_event.synchronize() File "/opt/conda/lib/python3.12/site-packages/torch/cuda/streams.py", line 231, in synchronize super().synchronize() torch.AcceleratorError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Oct 01 '25 09:10 af-74413592

What's your environment, and in what step does it go wrong? (training/validation/first step/after a few steps) Do you have multiple GPUs and multiple nodes?

torch: 2.8.0+cu128

verl: version: v0.5.0

vllm: version: v0.10.2

flash-attn version: v2.8.3 usually, it randomly failed in training step when DataProto move tensordict from cpu to gpu. which in verl is the code line self.batch.to(device), also found failed in load_fsdp_model_to_gpu(self.actor_module_fsdp), but i think the reason is these lines in VLLM:

File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1929, in _bookkeeping_sync valid_sampled_token_ids = self._to_list(sampled_token_ids) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3742, in _to_list self.transfer_event.synchronize() File "/opt/conda/lib/python3.12/site-packages/torch/cuda/streams.py", line 231, in synchronize super().synchronize() torch.AcceleratorError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Occasional failures caused by unsafe CUDA memory operations? I am not sure.

Oct 01 '25 09:10 af-74413592

In the near future I plan to try to implement an agentlightning version with sglang as the backend to see if this problem still occurs.

Oct 01 '25 09:10 af-74413592

torch.AcceleratorError: CUDA error: an illegal memory access was encountered

This looks like a GPU OOM error to me.

In the near future I plan to try to implement an agentlightning version with sglang as the backend to see if this problem still occurs.

Good luck to that! Welcome to contribute back if you have made some progress. 👍

Oct 01 '25 16:10 ultmaster

torch.AcceleratorError: CUDA error: an illegal memory access was encountered

This looks like a GPU OOM error to me.

In the near future I plan to try to implement an agentlightning version with sglang as the backend to see if this problem still occurs.

Good luck to that! Welcome to contribute back if you have made some progress. 👍

verl canceled chat_completion design in latest version which is very inconvenient, i think its time to change a RL backend like AReaL

Oct 02 '25 12:10 af-74413592

verl canceled chat_completion design in latest version which is very inconvenient

Surprised to know. Seems that we need to figure out a plan. Either using verl in a different way, or switching to a different framework. Will look into AReal.

Oct 02 '25 17:10 ultmaster

verl canceled chat_completion design in latest version which is very inconvenient

Surprised to know. Seems that we need to figure out a plan. Either using verl in a different way, or switching to a different framework. Will look into AReal.

After switch to sglang, training became faster and more stable. If I have time, I will try to PR a sglang version soon.

Oct 16 '25 03:10 af-74413592