text-generation-inference
text-generation-inference copied to clipboard
Unable to deploy lmsys/vicuna-13b-v1.5-16k
System Info
import json
from sagemaker.huggingface import HuggingFaceModel
# sagemaker config
instance_type = "ml.g5.48xlarge"
number_of_gpu = 2
health_check_timeout = 300
# Define Model and Endpoint configuration parameter
config = {
'HF_MODEL_ID': "lmsys/vicuna-13b-v1.5-16k", # model_id from hf.co/models
'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(10000), # Max length of input text
'MAX_BATCH_PREFILL_TOKENS':json.dumps(10000),
'MAX_TOTAL_TOKENS': json.dumps(15000), # Max length of the generation (including input text)
'HUGGING_FACE_HUB_TOKEN': "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"
# ,'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}
# check if token is set
assert config['HUGGING_FACE_HUB_TOKEN'] != "<REPLACE WITH YOUR TOKEN>", "Please set your Hugging Face Hub token"
# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env=config
)
Information
- [ ] Docker
- [ ] The CLI directly
Tasks
- [ ] An officially supported command
- [ ] My own modifications
Reproduction
RuntimeError: Not enough memory to handle 16000 total tokens with 10000 prefill tokens. You need to decrease
--max-batch-total-tokensor
--max-batch-prefill-tokens``
Expected behavior
It should not give error . Its just a 13b model and i am deploying 48x large.
Now if i decrease the max-batch-prefill-tokens it says MAX_INPUT_LENGTH cant be more than max-batch-prefill-tokens
@monuminu Yes you need to adjust all parameters so that the requests can fit the extra VRAM left after the model is loaded.
Hi @Narsil , Wanted to understand more about the concept here. Its a 13b model which takes around 25GB of memory. I have g5.48xlarge with 192 GB of memory. How can i ensure I am using the correct parameter , Do you have some reference for others too
Hey! I am able to deploy lmsys/vicuna-13b-v1.5-16k
on 4 x Nvidia A10Gs (g5.12xlarge) using the latest image
Here is the command I am using to run it
docker run --gpus all --shm-size 1g -p 8080:80 -v ./models:/data ghcr.io/huggingface/text-generation-inference:1.0.0 --model-id lmsys/vicuna-13b-v1.5-16k --num-shard=4 --max-input-length=16000 --max-total-tokens=16000 --max-batch-total-tokens=16000 --max-batch-prefill-tokens=16000 --rope-scaling=linear --rope-factor=4.0
what is your TGI version , i am using 0.9.3
I am using v1.0.0 because AFAIK, TGI supported Rope Scaling after releasing v1.0.0 and lmsys/vicuna-13b-v1.5-16k
uses it.
From their HF Page ->
Vicuna v1.5 (16k) is fine-tuned from Llama 2 with supervised instruction fine-tuning and linear RoPE scaling.
I am using v1.0.0 because AFAIK, TGI supported Rope Scaling after releasing v1.0.0 and
lmsys/vicuna-13b-v1.5-16k
uses it.From their HF Page ->
Vicuna v1.5 (16k) is fine-tuned from Llama 2 with supervised instruction fine-tuning and linear RoPE scaling.
Thank you for sharing, but I couldn't find some code about Rope in the v1.0.0
tag, and I found the code from this commit, so I think the correct Docker img tag should be latest
.
I am not sure it's possible to just set all of them as 16000
. This is what's possible due to the following validations (v1.0.1):
--max-input-length=16000 \
--max-batch-prefill-tokens=16000 \
--max-total-tokens=16100 \
--max-batch-total-tokens=16100
# Error: ArgumentValidation("`max_total_tokens` must be <= `max_batch_total_tokens`. Given: 16100 and 16000")
# Error: ArgumentValidation("`max_input_length` must be < `max_total_tokens`")
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.