vllm [Bug]: "500 Internal Server Error" after upgrade to v0.5.4

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

After I upgraded to v0.5.4, got "500 Internal Server Error". My manifest snippet to start vllm:

      containers:
      - name: 8x7b-open
        image: vllm/vllm-openai:v0.5.4
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]                                                                                                                        
        args: ["--model", "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4", "--host", "0.0.0.0", "--port", "8080", "--tensor-parallel-size", "2", "--seed", "42", "--trust-remote-code"]                      
        securityContext:                                                                                                                                                                        
          privileged: true                                                                                                                                                                      
        ports:                                                                                                                                                                                  
        - containerPort: 8080                                                                                                                                                                   
        env:                                                                                                                                                                                    
        - name: OMP_NUM_THREADS                                                                                                                                                               
          value: "2"                                                                                                                                                                          
        volumeMounts:                                                                                                                                                                           
          - mountPath: "/root/.cache"                                                                                                                                                           
            name: ceph-volume                                                                                                                                                                   
        resources:                                                                                                                                                                              
          limits:                                                                                                                                                                               
            cpu: '12'                                                                                                                                                                           
            memory: 200Gi                                                                                                                                                                       
            nvidia.com/gpu: '2'                                                                                                                                                                 
          requests:                                                                                                                                                                             
            cpu: '12'                                                                                                                                                                           
            memory: 200Gi                                                                                                                                                                       
            nvidia.com/gpu: '2'

Backtrace log:

INFO:     10.254.17.246:59936 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 189, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 185, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 436, in chat_completion_full_generator
    async for res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 196, in generate
    with self.socket() as socket:
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 59, in socket
    socket = self.context.socket(zmq.constants.DEALER)
  File "/usr/local/lib/python3.10/dist-packages/zmq/sugar/context.py", line 354, in socket
    socket_class(  # set PYTHONTRACEMALLOC=2 to get the calling frame
  File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 218, in __init__
    super().__init__(context, socket_type, **kwargs)  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/zmq/sugar/socket.py", line 156, in __init__
    super().__init__(
  File "_zmq.py", line 690, in zmq.backend.cython._zmq.Socket.__init__
zmq.error.ZMQError: Too many open files

Aug 08 '24 03:08 tonyaw

Also ulimit and lsof info:

root@8x7b-open-deployment-9fb777c9d-mwq8b:/vllm-workspace# lsof | grep pt_main_t | wc -l
26295
root@8x7b-open-deployment-9fb777c9d-mwq8b:/vllm-workspace# ulimit -n
1048576
root@8x7b-open-deployment-9fb777c9d-mwq8b:/vllm-workspace#

Aug 08 '24 03:08 tonyaw

cc @robertgshaw2-neuralmagic

Aug 08 '24 04:08 youkaichao

@tonyaw if you want a quick solution, you can try to add --disable-frontend-multiprocessing

Aug 08 '24 04:08 youkaichao

What's the side effect by adding this parameter "--disable-frontend-multiprocessing"? It isn't caused by OMP_NUM_THREADS=2, right? I have two A100, so OMP_NUM_THREADS shall be 2 right?

Thanks in advance!

Aug 08 '24 05:08 tonyaw

--disable-frontend-multiprocessing will be slower

usually people don't need to set OMP_NUM_THREADS for vLLM

Aug 08 '24 05:08 youkaichao

Thanks, I will do an analysis of how many unix sockets are opened up and see if there is anything we can do to reduce the amount, since we currently open a new socket for each generate request

Aug 08 '24 12:08 robertgshaw2-redhat

--disable-frontend-multiprocessing will be slower

usually people don't need to set OMP_NUM_THREADS for vLLM

@youkaichao @robertgshaw2-neuralmagic I have set this param --disable-frontend-multiprocessing, but still get the error as follows:

File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/openai/_base_client.py", line 1074, in _retry_request
    return self._request(
           ^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/openai/_base_client.py", line 1026, in _request
    return self._retry_request(
           ^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/openai/_base_client.py", line 1074, in _retry_request
    return self._request(
           ^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/openai/_base_client.py", line 1026, in _request
    return self._retry_request(
           ^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/openai/_base_client.py", line 1074, in _retry_request
    return self._request(
           ^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/openai/_base_client.py", line 1026, in _request
    return self._retry_request(
           ^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/openai/_base_client.py", line 1074, in _retry_request
    return self._request(
           ^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/openai/_base_client.py", line 1041, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.InternalServerError: Error code: 500 - {'detail': ''}

my vllm version is latest 0.5.5, and the cmd is

python -m vllm.entrypoints.openai.api_server \
        --model /data/pretrain_dir/Meta-Llama-3-8B-Instruct \
        --trust-remote-code \
        --port $port \
        --dtype auto \
        --pipeline-parallel-size 1 \
        --enforce-eager \
        --enable-prefix-caching \
        --enable-lora \
        --disable-frontend-multiprocessing

Sep 05 '24 03:09 TangJiakai

The interesting thing is that even when I enter only one prompt at a time (to ensure the LLM isn't overloaded) during a certain period for testing the large model, it can still sometimes generate successfully and sometimes fail. The error when it fails is still "Error code: 500 - {'detail': ''}".

Sep 05 '24 03:09 TangJiakai

@TangJiakai this looks like a client side error. do you have the server side error trace?

Sep 05 '24 05:09 youkaichao

@TangJiakai this looks like a client side error. do you have the server side error trace?

Yes, you are right! It's happened on client side.

Sep 13 '24 03:09 TangJiakai

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Dec 13 '24 02:12 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Jan 13 '25 02:01 github-actions[bot]

I am facing the same issue only with Llama

Server side:

python -m vllm.entrypoints.openai.api_server \
    --model /onyx/data/p118/huggingface_LLMs/meta-llama/Llama-3.1-8B-Instruct/ \
    --host 0.0.0.0 \
    --port 3000 \
    --gpu_memory-utilization 0.7 \
    --tensor-parallel-size 1 \
    --pipeline-parallel-size 2 \
    --device cuda \
    --enforce-eager \
    --dtype=half

Call:

curl http://172.30.1.111:3000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/onyx/data/p118/huggingface_LLMs/meta-llama/Llama-3.1-8B-Instruct/",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

Result:

File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1721, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/model_executor/models/llama.py", line 539, in forward
    model_output = self.model(input_ids, positions, kv_caches,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/compilation/decorators.py", line 170, in __call__
    return self.forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/model_executor/models/llama.py", line 363, in forward
    hidden_states, residual = layer(positions, hidden_states,
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/model_executor/models/llama.py", line 277, in forward
    hidden_states = self.self_attn(positions=positions,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/model_executor/models/llama.py", line 201, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/attention/layer.py", line 184, in forward
    return torch.ops.vllm.unified_attention(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/attention/layer.py", line 290, in unified_attention
    return self.impl.forward(self, query, key, value, kv_cache, attn_metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/attention/backends/xformers.py", line 572, in forward
    out = PagedAttention.forward_prefix(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/attention/ops/paged_attn.py", line 211, in forward_prefix
    context_attention_fwd(
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/attention/ops/prefix_prefill.py", line 825, in context_attention_fwd
    _fwd_kernel[grid](
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/triton/runtime/jit.py", line 345, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/triton/runtime/jit.py", line 607, in run
    device = driver.active.get_current_device()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/triton/runtime/driver.py", line 23, in __getattr__
    self._initialize_obj()
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
    self._obj = self._init_fn()
                ^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/triton/runtime/driver.py", line 9, in _create_driver
    return actives[0]()
           ^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/triton/backends/nvidia/driver.py", line 371, in __init__
    self.utils = CudaUtils()  # TODO: make static
                 ^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/triton/backends/nvidia/driver.py", line 80, in __init__
    mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
    so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/triton/runtime/build.py", line 48, in _build
    ret = subprocess.check_call(cc_cmd)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/subprocess.py", line 413, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/nvme/h/buildsets/eb_cyclone_rl/software/GCCcore/11.2.0/bin/gcc', '/tmp/tmp9hi7g5f0/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmp9hi7g5f0/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/nvme/h/lb21hg1/llm-env/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/lib64', '-L/lib', '-I/nvme/h/lb21hg1/llm-env/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmp9hi7g5f0', '-I/usr/include/python3.11']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/starlette/applications.py", line 112, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/starlette/routing.py", line 714, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/starlette/routing.py", line 734, in app
    await route.handle(scope, receive, send)
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/entrypoints/utils.py", line 54, in wrapper
    return handler_task.result()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 390, in create_chat_completion
    generator = await handler.create_chat_completion(request, raw_request)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/entrypoints/openai/serving_chat.py", line 261, in create_chat_completion
    return await self.chat_completion_full_generator(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/entrypoints/openai/serving_chat.py", line 680, in chat_completion_full_generator
    async for res in result_generator:
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 1004, in generate
    async for output in await self.add_request(
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 114, in generator
    raise result
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 56, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 823, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 746, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 351, in step_async
    outputs = await self.model_executor.execute_model_async(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/executor/executor_base.py", line 343, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/executor/mp_distributed_executor.py", line 231, in _driver_execute_model_async
    results = await asyncio.gather(*tasks)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/utils.py", line 1329, in _run_task_with_lock
    return await task(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/worker/worker_base.py", line 411, in execute_model
    output = self.model_runner.execute_model(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/nvme/h/lb21hg1/llm-env/lib64/python3.11/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
    raise type(err)(
          ^^^^^^^^^^
TypeError: CalledProcessError.__init__() missing 1 required positional argument: 'cmd'

Mar 10 '25 06:03 hiyamgh