vllm Aborted request without reason

Hi, i am trying to load test vllm on a single gpu with 20 concurrent request. Each request would pass through the llm engine twice. Once to change the prompt, the other to generate the output.

However, normally only between 6 - 10 requests finished successfully. The remaining requests would normally just be displayed as "Aborted request" without any error logs or trace. GPU KV cache can be up to 95% for those sequences which finished successfully.

INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 0e89f1d2d94c4a039f868222c100cc8a.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request be67046b843244b5bf1ed3d2ff2f5a02.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request b532ed57647945869a4cae499fe54f23.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 6c56897bbc9d4a808b8e056c39baf91b.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 75b645c69d7449509f68ca23b34f1048.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request eb87d6473a9d4b3699ca0cc236900248.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request ca15a251849c45329825ca95a2fce96b.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request c42bbea2f781469e89e576f98e618243.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 14b94d4ffd6646d69d4c2ad36d7dfd50.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 83c7dd9cbe9d4f6481b26403f46f1731.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 3e98245d88534c53be230aa25c56d99a.
INFO 01-17 03:11:26 async_llm_engine.py:111] Finished request 49b84ef96af44f069056b2fc43526cdd.

I have been trying to find ways to find out what's the cause of the abortion without success. Appreciate any help. Thanks!

My code to load the engine is as follows:

engine_args = AsyncEngineArgs(
    model=LLM_BASEMODEL,
    tokenizer=LLM_BASEMODEL,
    tokenizer_mode="auto",
    trust_remote_code=True,
    download_dir=None,
    load_format="auto",
    dtype="auto",
    seed=0,
    max_model_len=None,
    worker_use_ray=False,
    pipeline_parallel_size=1,
    tensor_parallel_size=1,
    max_parallel_loading_workers=None,
    block_size=16,
    swap_space=4,
    gpu_memory_utilization=0.9,
    max_num_batched_tokens=None,
    max_num_seqs=256,
    max_paddings=256,
    disable_log_stats=False,
    revision=None,
    tokenizer_revision=None,
    quantization="gptq",
    enforce_eager=False,
    max_context_len_to_capture=8192,
    engine_use_ray=False,
    disable_log_requests=False,
    max_log_len=None,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)

Jan 18 '24 08:01 erjieyong

mark

Jan 20 '24 04:01 ARayOfsunshine

+1. And the HTTP status code is 200. It's weird.

Jan 30 '24 11:01 HermitSun

+1

Jan 30 '24 15:01 CoLdIcE42

For my case, I've managed to identify that the issue is likely not due to vllm. My reason because i am able to call my backend api using the server ip address without any aborted request. In comparison, if i were to call the backend api using my domain, i will face this aborted request issue.

My suspicion is that nginx is meddling with the configuration but have yet to find the fix yet.

Feb 06 '24 08:02 erjieyong

Same here with vLLM v.0.3.0:

INFO 02-25 12:13:05 llm_engine.py:877] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 62.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 5.5%, CPU KV cache usage: 0.0%
INFO 02-25 12:13:10 llm_engine.py:877] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 62.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 5.7%, CPU KV cache usage: 0.0%
INFO 02-25 12:13:11 async_llm_engine.py:133] Aborted request cmpl-4ed3ef5554c44cd98158a194a4439744.
INFO 02-25 12:13:12 async_llm_engine.py:431] Received request cmpl-5ed32b9b9ff645718b4794cfe2cb37b0: prompt

Feb 26 '24 00:02 vecorro

Same here on vllm 0.3.3, when sending 4 request to a mistral instruct 7B v0.2 on a single nvidia A40. Occurs both with default swap-size and swap-size=0. Both vllm and client run on same host, and using loopback ip 127.0.0.1 to send requests.

Mar 08 '24 15:03 ghpu

Confirmed - this happens for me as well, with latest vllm. This time serving Mixtral 8x7b with AWQ. Doesn't happen for all prompts and it seems to happen more often when hitting completions rather than chat/completions.

Mar 28 '24 21:03 prakashsanker

Ok so -

This is a clear problem with the implementation of the inference server, rather than the core vllm engine. I had a prompt that was consistently failing when hitting the open ai compatible server.
I copied/pasted the prompt and ran it with offline inference --> the prompt worked.

Apr 02 '24 20:04 prakashsanker

@simon-mo just tagging you here to investigate.

Apr 03 '24 17:04 prakashsanker

cc @njhill

Apr 03 '24 18:04 simon-mo

+1, the same here on vLLM 0.3.3 on request 105 (this time, it differs) to "BASE_URL/v1/chat/completions". Also got a 200 OK response and nothing to see in the vLLM log except: INFO 04-16 12:11:09 async_llm_engine.py:133] Aborted request cmpl-c039d6b5cc084840bab5a439c4e2144c.

I can make the request again and it works, it usually happens in the middle of a long running test. The model used is casperhansen/mixtral-instruct-awq

Any help is much appreciated.

Apr 16 '24 13:04 ptsand

Is there any indication of an exception in the response?

There are a few places in async_llm_engine.py that call this method and exceptions are placed into the request stream but not logged out in the server.

Apr 16 '24 14:04 br3no

Did some more testing and debugging, I can recreate the error quickly by running multiple prompts async. The calling code cannot receive any response body:

2024-04-17 14:11:30,767 [INFO] [_client.py:1729] HTTP Request: POST https://[BASE_URL]/v1/chat/completions "HTTP/1.1 200 OK"
2024-04-17 14:11:30,768 [DEBUG] [_trace.py:85] receive_response_body.started request=<Request [b'POST']>
2024-04-17 14:11:47,340 [DEBUG] [_trace.py:85] receive_response_body.failed exception=ReadError(BrokenResourceError())
2024-04-17 14:11:47,340 [DEBUG] [_trace.py:85] response_closed.started
2024-04-17 14:11:47,341 [DEBUG] [_trace.py:85] response_closed.complete
ConnectionResetError: [Errno 54] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_exceptions.py", line 10, in map_exceptions
    yield
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 34, in read
    return await self._stream.receive(max_bytes=max_bytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/anyio/streams/tls.py", line 196, in receive
    data = await self._call_sslobject_method(self._ssl_object.read, max_bytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/anyio/streams/tls.py", line 138, in _call_sslobject_method
    data = await self.transport_stream.receive()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 1212, in receive
    raise self._protocol.exception
anyio.BrokenResourceError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_transports/default.py", line 66, in map_httpcore_exceptions
    yield
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_transports/default.py", line 249, in __aiter__
    async for part in self._httpcore_stream:
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 361, in __aiter__
    async for part in self._stream:
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_async/http11.py", line 337, in __aiter__
    raise exc
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_async/http11.py", line 329, in __aiter__
    async for chunk in self._connection._receive_response_body(**kwargs):
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_async/http11.py", line 198, in _receive_response_body
    event = await self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_async/http11.py", line 212, in _receive_event
    data = await self._network_stream.read(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 31, in read
    with map_exceptions(exc_map):
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ReadError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/mistralai/async_client.py", line 125, in _request
    async for line in response.aiter_lines():
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_models.py", line 966, in aiter_lines
    async for text in self.aiter_text():
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_models.py", line 953, in aiter_text
    async for byte_content in self.aiter_bytes():
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_models.py", line 932, in aiter_bytes
    async for raw_bytes in self.aiter_raw():
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_models.py", line 990, in aiter_raw
    async for raw_stream_bytes in self.stream:
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_client.py", line 146, in __aiter__
    async for chunk in self._stream:
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_transports/default.py", line 248, in __aiter__
    with map_httpcore_exceptions():
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_transports/default.py", line 83, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ReadError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/sand/Capacit/seek/seek-api-telia/src/api/controllers/v1/conversation.py", line 147, in __acall__
    async for event in handler_method(request):
  File "/Users/sand/Capacit/seek/seek-api-telia/src/api/controllers/v1/conversation.py", line 235, in create_message
    async for token in stream:
  File "/Users/sand/Capacit/seek/seek-api-telia/src/infrastructure/resolvers/prompt_resolver.py", line 73, in astreamer
    async for message in generator:
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py", line 305, in astream
    raise e
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py", line 287, in astream
    async for chunk in self._astream(
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/langchain_mistralai/chat_models.py", line 369, in _astream
    async for chunk in await acompletion_with_retry(
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/mistralai/async_client.py", line 261, in chat_stream
    async for json_response in async_response:
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/mistralai/async_client.py", line 143, in _request
    raise MistralException(f"Unexpected exception ({e.__class__.__name__}): {e}") from e
mistralai.exceptions.MistralException: Unexpected exception (ReadError):

500: Internal Server Error during event handling.

vLLM arguments:

INFO 04-17 12:21:55 api_server.py:228] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='/root/mixtral-instruct-awq', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=24000, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.75, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization='awq', enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', engine_use_ray=False, disable_log_requests=False, max_log_len=None)

Apr 17 '24 12:04 ptsand

I also got the same error. This is what I ran

python3 -u -m vllm.entrypoints.openai.api_server \
       --host 0.0.0.0 \
       --model casperhansen/mixtral-instruct-awq \
       --tensor-parallel-size 4 \
      --enforce-eager \
      --quantization awq \
      --gpu-memory-utilization 0.96

Any solution?

Apr 18 '24 07:04 cocoza4

Having the same issue with more than 100 concurrent requests at a time.

Running the OpenAI Compatible server and continuously hitting it with > 100 concurrent requests leads to many requests just simply getting request aborted with no specified reason. I have tried setting swap space to 0 to no avail.

Apr 19 '24 19:04 davidgxue

No explanations ? Can it be about a lack of GPU memory ? Or is it because i increased the --max-model-len to 32768 ?

Apr 23 '24 15:04 ArnaudHureaux

It seems to be much worse when increasing the load. I don't think it's GPU memory cause I needed to reduce max-model-len to avoid getting out of memory errors running on 2 nvidia A10 GPUs (24 GB memory).

It is running very slowly (unusable I would say) when I run my load tests on it where I simulate up to 40 humans prompting at 40 words per minute.

If anyone knows approximately when vLLM is getting optimized for AWQ models and how big a perfomance increase is expected I would be very happy to know!

Apr 24 '24 06:04 ptsand

Let’s recap the issue discussion until now: It happens with different models. It happens with different GPUs. It happens with different quantization methods. It happens on high load.

@prakashsanker’s hypothesis is that this is an issue with the server and not the core engine. I believe he is right.

Maybe someone here is knowledgeable about asynchronous webserving in Python and has time to investigate?

I suppose the expected behaviour would be for the server to start returning 503 once it can’t handle more requests.

Apr 24 '24 06:04 br3no

I think the problem I was running into maybe different from what you folks have. I figured out mine and it's more of a client configuration problem with HTTP calls to fastapi. I will draft up a PR and new issue specific for issue related to the one I ran into, and to be honest, it's more of a logging is not great when the request is aborted than anything else. I also noticed that if a request is aborted it seems like the LLM still tries to run the request anyway? I could be wrong

Apr 25 '24 04:04 davidgxue

vllm vllm copied to clipboard

Aborted request without reason

vllm
vllm copied to clipboard