CUDA error: an illegal memory access was encountered
agentlightning 0.1.2 flash_attn 2.8.3 torch 2.7.0 torchaudio 2.7.0 torchdata 0.11.0 I also tested the the newest version and it still does not work. I found the memory error usually happens after collecting the LAST rollout for training step. I have researched on this problem for quite a long time. In comparison, when I use fewer mcps with a much simpler task, the program works all fine. I think this problem may relate to the long duration of vllm rollout and the vllm shutdown process. Or memory leakage of the rollout management. I fixed a bug of "KeyError: rollout-xxx". My solution is simply skip this rollout ID.
Please help me with your better framework understanding and CUDA experience.
Below is the log of the CUDA memory error. subsequential errors are omitted.
`[36m(TaskRunner pid=2349265)[0m 2025-12-03 02:37:11,023 [INFO] (Process-2349265 agentlightning.server) Rollout received and stored: rollout-f168ae22-0c8c-4e96-8b33-d95450cf5cea
[36m(TaskRunner pid=2349265)[0m INFO:2025-12-03 02:37:18,359:127.0.0.1 - - [03/Dec/2025 02:37:18] "POST /v1/chat/completions HTTP/1.1" 200 -
[36m(TaskRunner pid=2349265)[0m INFO:2025-12-03 02:37:22,851:127.0.0.1 - - [03/Dec/2025 02:37:22] "POST /v1/chat/completions HTTP/1.1" 200 -
[36m(TaskRunner pid=2349265)[0m INFO:2025-12-03 02:37:22,854:127.0.0.1 - - [03/Dec/2025 02:37:22] "POST /v1/chat/completions HTTP/1.1" 200 -
[36m(WorkerDict pid=2350100)[0m [rank0]:[E1203 02:37:25.983063040 ProcessGroupNCCL.cpp:1896] [PG ID 6 PG GUID 15 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
[36m(WorkerDict pid=2350100)[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[36m(WorkerDict pid=2350100)[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[36m(WorkerDict pid=2350100)[0m Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
No relation to config.
It's a known issue and I think there's a candidate solution proposed by community folks. I'll submit a PR and you can test whether that works.
@ultmaster This issue has troubled my team quite a lot. I search memory solutions in "pull request". I did not find the solution you mentioned. could you link that solution to this issue?
@ultmaster Thank you for your kind reply!
Hold on a sec. I'm just looking for the solution from the chat and developing the PR.
I've took a look and it's not as easy as I originally thought.
First part: The related issue is: https://github.com/vllm-project/vllm/issues/17103
This sleep() is called within verl; and I found that in v0.6.1, reset_prefix_cache() is already called:
So that's not the real problem although I couldn't confirm it for sure that verl is executed in this colocated path.
Second part: The community folks (from Tencent) claims that before that sleep, we need to actually abort the running requests (FINISHED_ABORTED). After that, we can proceed to reset_prefix_cache and sleep to release the memory. The code snippet should be inserted into vLLM. I think they are suggesting some changes like this:
(to copy the code:)
from typing import cast
from vllm.v1.core.sched.scheduler import Scheduler
scheduler = cast(Scheduler, self.scheduler)
requests_to_abort: list[str] = []
if self.scheduler.has_unfinished_requests():
for request in scheduler.waiting:
requests_to_abort.append(request.request_id) # is this usage correct???
for request in scheduler.running:
requests_to_abort.append(request.request_id)
self.abort_requests(requests_to_abort)
I'm not very sure about this; and I really need your feedback before we can do anything. The issue is not very reproducible at our side.
Thank you! The proposed solution works fine. The bug is from vllm, which is unexpected since we consider it as a mature framework.