text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Unable to deploy lmsys/vicuna-13b-v1.5-16k

Open monuminu opened this issue 1 year ago • 8 comments

System Info

import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.48xlarge"
number_of_gpu = 2
health_check_timeout = 300

# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "lmsys/vicuna-13b-v1.5-16k", # model_id from hf.co/models
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(10000),  # Max length of input text
  'MAX_BATCH_PREFILL_TOKENS':json.dumps(10000),
  'MAX_TOTAL_TOKENS': json.dumps(15000),  # Max length of the generation (including input text)
  'HUGGING_FACE_HUB_TOKEN': "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"
  # ,'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}

# check if token is set
assert config['HUGGING_FACE_HUB_TOKEN'] != "<REPLACE WITH YOUR TOKEN>", "Please set your Hugging Face Hub token"

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

Information

  • [ ] Docker
  • [ ] The CLI directly

Tasks

  • [ ] An officially supported command
  • [ ] My own modifications

Reproduction

RuntimeError: Not enough memory to handle 16000 total tokens with 10000 prefill tokens. You need to decrease --max-batch-total-tokensor--max-batch-prefill-tokens``

Expected behavior

It should not give error . Its just a 13b model and i am deploying 48x large.

monuminu avatar Aug 08 '23 09:08 monuminu

Now if i decrease the max-batch-prefill-tokens it says MAX_INPUT_LENGTH cant be more than max-batch-prefill-tokens

monuminu avatar Aug 08 '23 10:08 monuminu

@monuminu Yes you need to adjust all parameters so that the requests can fit the extra VRAM left after the model is loaded.

Narsil avatar Aug 08 '23 10:08 Narsil

Hi @Narsil , Wanted to understand more about the concept here. Its a 13b model which takes around 25GB of memory. I have g5.48xlarge with 192 GB of memory. How can i ensure I am using the correct parameter , Do you have some reference for others too

monuminu avatar Aug 08 '23 10:08 monuminu

Hey! I am able to deploy lmsys/vicuna-13b-v1.5-16k on 4 x Nvidia A10Gs (g5.12xlarge) using the latest image

Here is the command I am using to run it docker run --gpus all --shm-size 1g -p 8080:80 -v ./models:/data ghcr.io/huggingface/text-generation-inference:1.0.0 --model-id lmsys/vicuna-13b-v1.5-16k --num-shard=4 --max-input-length=16000 --max-total-tokens=16000 --max-batch-total-tokens=16000 --max-batch-prefill-tokens=16000 --rope-scaling=linear --rope-factor=4.0

rohanpooniwala avatar Aug 08 '23 10:08 rohanpooniwala

what is your TGI version , i am using 0.9.3

monuminu avatar Aug 08 '23 10:08 monuminu

I am using v1.0.0 because AFAIK, TGI supported Rope Scaling after releasing v1.0.0 and lmsys/vicuna-13b-v1.5-16k uses it.

From their HF Page ->

Vicuna v1.5 (16k) is fine-tuned from Llama 2 with supervised instruction fine-tuning and linear RoPE scaling.

rohanpooniwala avatar Aug 08 '23 10:08 rohanpooniwala

I am using v1.0.0 because AFAIK, TGI supported Rope Scaling after releasing v1.0.0 and lmsys/vicuna-13b-v1.5-16k uses it.

From their HF Page ->

Vicuna v1.5 (16k) is fine-tuned from Llama 2 with supervised instruction fine-tuning and linear RoPE scaling.

Thank you for sharing, but I couldn't find some code about Rope in the v1.0.0 tag, and I found the code from this commit, so I think the correct Docker img tag should be latest.

zTaoplus avatar Aug 10 '23 03:08 zTaoplus

I am not sure it's possible to just set all of them as 16000. This is what's possible due to the following validations (v1.0.1):

--max-input-length=16000 \
--max-batch-prefill-tokens=16000 \
--max-total-tokens=16100 \
--max-batch-total-tokens=16100

# Error: ArgumentValidation("`max_total_tokens` must be <= `max_batch_total_tokens`. Given: 16100 and 16000")
# Error: ArgumentValidation("`max_input_length` must be < `max_total_tokens`")

maziyarpanahi avatar Aug 22 '23 17:08 maziyarpanahi

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Apr 13 '24 01:04 github-actions[bot]