Upgrade to support latest vLLM version (max_lora_rank)
Description
In the current version (using LMI sagemaker image), we are running into the following error:
File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 1288, in __post_init__
raise ValueError(
ValueError: max_lora_rank (128) must be one of (8, 16, 32, 64)
Looks like above error was fixed in vllm version v0.5.5. See release notes here: https://github.com/vllm-project/vllm/releases/tag/v0.5.5 See PR here: https://github.com/vllm-project/vllm/pull/7146
References
N/A
Hi @frankfliu - would you be able to help? Thanks.
We are planning a release that will use vllm 0.6.0 (or 0.6.1.post2) soon.
In the meantime, you can try providing a requirements.txt file with vllm==0.5.5 (or later version) to get around this.
Thank you @siddvenk for your suggestions.
I tried rebuilding the custom image by running pip install vllm==0.5.5 in a Dockerfile, from your latest stable image 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124
We specified the followings in serving.properties file:
option.model_id=unsloth/mistral-7b-instruct-v0.3
option.engine=Python
option.rolling_batch=vllm
option.tensor_parallel_degree=1
option.enable_lora=true
option.gpu_memory_utilization=0.95
option.max_model_len=16000
option.max_lora_rank=128
We tried setting max_token to a really high number but the response is still very short.
We also get this log, and appears the vLLM backend does not support max_tokens param.
The following parameters are not supported by vllm with rolling batch: {'logprobs', 'temperature', 'seed', 'max_tokens'}. The supported parameters are set()
Do you have any insights?
Yes, you should use max_new_tokens.
You can find the schema for our inference api here https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/user_guides/lmi_input_output_schema.md.
We also support the openai chat completions schema, details here https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/user_guides/chat_input_output_schema.md.
Thanks again for your quick response @siddvenk -
Just want to make sure, should we:
- Add
max_new_tokensto theserving.properties file, e.g.option.max_new_tokens=16000 - Or, pass
max_new_tokensas a parameter when invoking the endpoint, such as
curl -X POST https://my.sample.endpoint.com/invocations \
- H 'Content-Type: application/json' \
- d '
{
"inputs" : "What is Deep Learning?",
"parameters" : {
"do_sample": true,
"max_new_tokens": 16000,
"details": true,
},
"stream": true,
}'
btw, forgot to mention, we are deploying this to sagemaker
There are two different configurations.
On a per request basis, you can specify max_new_tokens to limit the number of generated tokens. This is just a limit on the output, not on the total sequence length.
You can limit the maximum length of sequences globally by setting option.max_model_len in serving.properties. This enforces a limit that applies to all requests, which includes both the input (prompt) tokens and generated output tokens.
Thanks, @siddvenk .
We did more tests and it turns out the "short response token" issue was only specific to the custom image I built (mentioned above).
So we suspect we missed some key steps when building the image - can you help us review our process?
Steps:
- Create following files
|- Dockerfile
|- requirements.txt
- In
Dockerfile:
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124
# Copy files
COPY ./requirements.txt /opt/requirements.txt
# Installs third-party Python dependencies within the Docker environment
RUN pip install --upgrade pip && \
pip install awscli --trusted-host pypi.org --trusted-host files.pythonhosted.org && \
pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org -r /opt/requirements.txt \
- In
requirements.txt:`
vllm==0.5.5
- Build the new docker image using
docker build
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.