agent-lightning icon indicating copy to clipboard operation
agent-lightning copied to clipboard

CUDA error: an illegal memory access was encountered

Open chenhr18thu opened this issue 1 month ago • 6 comments

agentlightning 0.1.2 flash_attn 2.8.3 torch 2.7.0 torchaudio 2.7.0 torchdata 0.11.0 I also tested the the newest version and it still does not work. I found the memory error usually happens after collecting the LAST rollout for training step. I have researched on this problem for quite a long time. In comparison, when I use fewer mcps with a much simpler task, the program works all fine. I think this problem may relate to the long duration of vllm rollout and the vllm shutdown process. Or memory leakage of the rollout management. I fixed a bug of "KeyError: rollout-xxx". My solution is simply skip this rollout ID.

Please help me with your better framework understanding and CUDA experience.

Below is the log of the CUDA memory error. subsequential errors are omitted.

`[36m(TaskRunner pid=2349265)[0m 2025-12-03 02:37:11,023 [INFO] (Process-2349265 agentlightning.server) Rollout received and stored: rollout-f168ae22-0c8c-4e96-8b33-d95450cf5cea

[36m(TaskRunner pid=2349265)[0m INFO:2025-12-03 02:37:18,359:127.0.0.1 - - [03/Dec/2025 02:37:18] "POST /v1/chat/completions HTTP/1.1" 200 -

[36m(TaskRunner pid=2349265)[0m INFO:2025-12-03 02:37:22,851:127.0.0.1 - - [03/Dec/2025 02:37:22] "POST /v1/chat/completions HTTP/1.1" 200 -

[36m(TaskRunner pid=2349265)[0m INFO:2025-12-03 02:37:22,854:127.0.0.1 - - [03/Dec/2025 02:37:22] "POST /v1/chat/completions HTTP/1.1" 200 -

[36m(WorkerDict pid=2350100)[0m [rank0]:[E1203 02:37:25.983063040 ProcessGroupNCCL.cpp:1896] [PG ID 6 PG GUID 15 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered

[36m(WorkerDict pid=2350100)[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

[36m(WorkerDict pid=2350100)[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1

[36m(WorkerDict pid=2350100)[0m Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

chenhr18thu avatar Dec 03 '25 03:12 chenhr18thu

No relation to config.

chenhr18thu avatar Dec 03 '25 03:12 chenhr18thu

It's a known issue and I think there's a candidate solution proposed by community folks. I'll submit a PR and you can test whether that works.

ultmaster avatar Dec 03 '25 07:12 ultmaster

@ultmaster This issue has troubled my team quite a lot. I search memory solutions in "pull request". I did not find the solution you mentioned. could you link that solution to this issue?

chenhr18thu avatar Dec 03 '25 08:12 chenhr18thu

@ultmaster Thank you for your kind reply!

chenhr18thu avatar Dec 03 '25 08:12 chenhr18thu

Hold on a sec. I'm just looking for the solution from the chat and developing the PR.

ultmaster avatar Dec 03 '25 08:12 ultmaster

I've took a look and it's not as easy as I originally thought.

First part: The related issue is: https://github.com/vllm-project/vllm/issues/17103

This sleep() is called within verl; and I found that in v0.6.1, reset_prefix_cache() is already called:

Image

So that's not the real problem although I couldn't confirm it for sure that verl is executed in this colocated path.

Second part: The community folks (from Tencent) claims that before that sleep, we need to actually abort the running requests (FINISHED_ABORTED). After that, we can proceed to reset_prefix_cache and sleep to release the memory. The code snippet should be inserted into vLLM. I think they are suggesting some changes like this:

Image

(to copy the code:)

        from typing import cast
        from vllm.v1.core.sched.scheduler import Scheduler
        scheduler = cast(Scheduler, self.scheduler)

        requests_to_abort: list[str] = []
        if self.scheduler.has_unfinished_requests():
            for request in scheduler.waiting:
                requests_to_abort.append(request.request_id)  # is this usage correct???
            for request in scheduler.running:
                requests_to_abort.append(request.request_id)

            self.abort_requests(requests_to_abort)

I'm not very sure about this; and I really need your feedback before we can do anything. The issue is not very reproducible at our side.

ultmaster avatar Dec 03 '25 09:12 ultmaster

Thank you! The proposed solution works fine. The bug is from vllm, which is unexpected since we consider it as a mature framework.

chenhr18thu avatar Dec 04 '25 08:12 chenhr18thu