vllm Task was killed due to the node running low on memory

In my case , I deployed the vllm service on 2 GPUs A800. but when I doing mutiple requests, I meet the ray OOM error. Could you please help check this problem? my model is fine-tuned llama-70b my transformers version is 4.34.1 my cuda version is 11.8, V11.8.89 my vllm version is 0.2.1.post1

When I was doing muti requests, An error came.

ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish task.result() File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 351, in run_engine_loop has_requests_in_progress = await self.engine_step() File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 330, in engine_step request_outputs = await self.engine.step_async() File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 191, in step_async output = await self._run_workers_async( File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 220, in _run_workers_async all_outputs = await asyncio.gather(*all_outputs) File "/usr/lib/python3.8/asyncio/tasks.py", line 695, in _wrap_awaitable return (yield from awaitable.await()) ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.

Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable RAY_memory_usage_threshold when starting Ray. To disable worker killing, set the environment variable RAY_memory_monitor_refresh_ms to zero.

The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 435, in run_asgi result = await app( # type: ignore[func-returns-value] File "/usr/local/lib/python3.8/dist-packages/uvicorn/middleware/proxy_headers.py", line 78, in call return await self.app(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/fastapi/applications.py", line 284, in call await super().call(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/applications.py", line 122, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 184, in call raise exc File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 162, in call await self.app(scope, receive, _send) File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 79, in call raise exc File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 68, in call await self.app(scope, receive, sender) File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 20, in call raise e File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 17, in call await self.app(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 718, in call await route.handle(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 276, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 69, in app await response(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/responses.py", line 277, in call await wrap(partial(self.listen_for_disconnect, receive)) File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 597, in aexit raise exceptions[0] File "/usr/local/lib/python3.8/dist-packages/starlette/responses.py", line 273, in wrap await func() File "/usr/local/lib/python3.8/dist-packages/starlette/responses.py", line 262, in stream_response async for chunk in self.body_iterator: File "llama_vllm_common_service.py", line 400, in stream_results async for request_output in results_generator: File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 436, in generate raise e File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 430, in generate async for request_output in stream: File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 70, in anext raise result File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 37, in _raise_exception_on_finish raise exc File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 32, in _raise_exception_on_finish raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

Nov 24 '23 01:11 ChristineSeven

I met the same error. @WoosukKwon @simon-mo Can you give us some help? Decrease swap-space in vllm or change RAY_memory_usage_threshold or RAY_memory_monitor_refresh_ms in ray?

Jan 03 '24 09:01 jessiewiswjc

I am also getting the same. Anyone else resolved this issue?

Feb 07 '24 07:02 tacacs1101-debug

Hey, Any updates on this issue?

Mar 04 '24 16:03 kalkite

vllm vllm copied to clipboard

Task was killed due to the node running low on memory

vllm
vllm copied to clipboard