djl-serving Upgrade to support latest vLLM version (max_lora

Description

In the current version (using LMI sagemaker image), we are running into the following error:

File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 1288, in __post_init__
raise ValueError(
ValueError: max_lora_rank (128) must be one of (8, 16, 32, 64)

Looks like above error was fixed in vllm version v0.5.5. See release notes here: https://github.com/vllm-project/vllm/releases/tag/v0.5.5 See PR here: https://github.com/vllm-project/vllm/pull/7146

References

N/A

Sep 16 '24 02:09 dreamiter

Hi @frankfliu - would you be able to help? Thanks.

Sep 16 '24 02:09 dreamiter

We are planning a release that will use vllm 0.6.0 (or 0.6.1.post2) soon.

In the meantime, you can try providing a requirements.txt file with vllm==0.5.5 (or later version) to get around this.

Sep 16 '24 15:09 siddvenk

Thank you @siddvenk for your suggestions.

I tried rebuilding the custom image by running pip install vllm==0.5.5 in a Dockerfile, from your latest stable image 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124

We specified the followings in serving.properties file:

option.model_id=unsloth/mistral-7b-instruct-v0.3
option.engine=Python
option.rolling_batch=vllm
option.tensor_parallel_degree=1
option.enable_lora=true
option.gpu_memory_utilization=0.95
option.max_model_len=16000
option.max_lora_rank=128

We tried setting max_token to a really high number but the response is still very short. We also get this log, and appears the vLLM backend does not support max_tokens param.

The following parameters are not supported by vllm with rolling batch: {'logprobs', 'temperature', 'seed', 'max_tokens'}. The supported parameters are set()

Do you have any insights?

Sep 18 '24 02:09 dreamiter

Yes, you should use max_new_tokens.

You can find the schema for our inference api here https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/user_guides/lmi_input_output_schema.md.

We also support the openai chat completions schema, details here https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/user_guides/chat_input_output_schema.md.

Sep 18 '24 03:09 siddvenk

Thanks again for your quick response @siddvenk -

Just want to make sure, should we:

Add max_new_tokens to the serving.properties file, e.g. option.max_new_tokens=16000
Or, pass max_new_tokens as a parameter when invoking the endpoint, such as

curl -X POST https://my.sample.endpoint.com/invocations \
  - H 'Content-Type: application/json' \
  - d '
    {
        "inputs" : "What is Deep Learning?", 
        "parameters" : {
            "do_sample": true,
            "max_new_tokens": 16000,
            "details": true,
        },
        "stream": true, 
    }'

Sep 18 '24 03:09 dreamiter

btw, forgot to mention, we are deploying this to sagemaker

Sep 18 '24 03:09 dreamiter

There are two different configurations.

On a per request basis, you can specify max_new_tokens to limit the number of generated tokens. This is just a limit on the output, not on the total sequence length.

You can limit the maximum length of sequences globally by setting option.max_model_len in serving.properties. This enforces a limit that applies to all requests, which includes both the input (prompt) tokens and generated output tokens.

Sep 18 '24 03:09 siddvenk

Thanks, @siddvenk .

We did more tests and it turns out the "short response token" issue was only specific to the custom image I built (mentioned above).

So we suspect we missed some key steps when building the image - can you help us review our process?

Steps:

Create following files

|- Dockerfile
|- requirements.txt

In Dockerfile:

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124

# Copy files
COPY ./requirements.txt /opt/requirements.txt

# Installs third-party Python dependencies within the Docker environment
RUN pip install --upgrade pip && \
    pip install awscli --trusted-host pypi.org --trusted-host files.pythonhosted.org && \
    pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org -r /opt/requirements.txt \

In requirements.txt:`

vllm==0.5.5

Build the new docker image using docker build

Sep 18 '24 15:09 dreamiter

This issue is stale because it has been open for 30 days with no activity.

Oct 26 '25 19:10 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

Nov 10 '25 19:11 github-actions[bot]

Upgrade to support latest vLLM version (max_lora_rank)

Description

References