[Help] Weird Error Messages in vLLM Async Rollout
Hi,
I am using vllm async rollout feature and I am getting some error messages like this:
(TaskRunner pid=1356207) Exception ignored in: <function _ConnectionBase.__del__ at 0x7fa889565630>
(TaskRunner pid=1356207) Traceback (most recent call last):
(TaskRunner pid=1356207) File "/home/mertunsal/miniconda3/envs/verl/lib/python3.10/multiprocessing/connection.py", line 132, in __del__
(TaskRunner pid=1356207) self._close()
(TaskRunner pid=1356207) File "/home/mertunsal/miniconda3/envs/verl/lib/python3.10/multiprocessing/connection.py", line 361, in _close
(TaskRunner pid=1356207) _close(self._handle)
(TaskRunner pid=1356207) File "/home/mertunsal/miniconda3/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 940, in sigterm_handler
(TaskRunner pid=1356207) raise_sys_exit_with_custom_error_message(
(TaskRunner pid=1356207) File "python/ray/_raylet.pyx", line 837, in ray._raylet.raise_sys_exit_with_custom_error_message
(TaskRunner pid=1356207) SystemExit: 1
This doesn't really block my code and my run seems to work as intended, however, I would like to understand what this is. I suspect that this is coming from how I query a server to get feedback for the second turn but I cannot really pinpoint the issue as the error doesn't really point to where this is coming from. Does anyone have any advice on how to debug this?
Thanks!!
Is it the same with this one https://github.com/volcengine/verl/issues/1642 ?
Same issue here. Everything is fine with a small training dataset, but this error occurs when I use more data.
Same issue here.
Same issue here. Everything is fine with a small training dataset, but this error occurs when I use more data.
Same issue here. Everything is fine with a small training dataset, but this error occurs when I use more data.
Same. When I use dataset larger than previous used, it occurs.