verl icon indicating copy to clipboard operation
verl copied to clipboard

[Help] Weird Error Messages in vLLM Async Rollout

Open mertunsall opened this issue 7 months ago • 5 comments

Hi,

I am using vllm async rollout feature and I am getting some error messages like this:

(TaskRunner pid=1356207) Exception ignored in: <function _ConnectionBase.__del__ at 0x7fa889565630>
(TaskRunner pid=1356207) Traceback (most recent call last):
(TaskRunner pid=1356207)   File "/home/mertunsal/miniconda3/envs/verl/lib/python3.10/multiprocessing/connection.py", line 132, in __del__
(TaskRunner pid=1356207)     self._close()
(TaskRunner pid=1356207)   File "/home/mertunsal/miniconda3/envs/verl/lib/python3.10/multiprocessing/connection.py", line 361, in _close
(TaskRunner pid=1356207)     _close(self._handle)
(TaskRunner pid=1356207)   File "/home/mertunsal/miniconda3/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 940, in sigterm_handler
(TaskRunner pid=1356207)     raise_sys_exit_with_custom_error_message(
(TaskRunner pid=1356207)   File "python/ray/_raylet.pyx", line 837, in ray._raylet.raise_sys_exit_with_custom_error_message
(TaskRunner pid=1356207) SystemExit: 1

This doesn't really block my code and my run seems to work as intended, however, I would like to understand what this is. I suspect that this is coming from how I query a server to get feedback for the second turn but I cannot really pinpoint the issue as the error doesn't really point to where this is coming from. Does anyone have any advice on how to debug this?

Thanks!!

mertunsall avatar May 24 '25 21:05 mertunsall

Is it the same with this one https://github.com/volcengine/verl/issues/1642 ?

chenhaiq avatar Jun 06 '25 07:06 chenhaiq

Same issue here. Everything is fine with a small training dataset, but this error occurs when I use more data.

yzl2343 avatar Jul 25 '25 09:07 yzl2343

Same issue here.

linengcs avatar Aug 17 '25 09:08 linengcs

Same issue here. Everything is fine with a small training dataset, but this error occurs when I use more data.

Terry9a avatar Nov 06 '25 11:11 Terry9a

Same issue here. Everything is fine with a small training dataset, but this error occurs when I use more data.

Same. When I use dataset larger than previous used, it occurs.

cehao628 avatar Nov 15 '25 14:11 cehao628