[Frontend] [Core] feat: Add model loading using `tensorizer`
Tensorizer Support
This PR allows models used for the OpenAI-compatible API server to be loaded using Coreweave's Tensorizer, enabling extremely fast (faster than cached Safetensors) model loads from HTTP/HTTPS, Redis, and S3 endpoints.
The key changes involved are:
- Listing
tensorizer>=2.8.1as a runtime dependency inrequirements.txt - Adds
tensorizer_loader.pytovllm/model_executorthat provides utility functions for tensorizer. - Adds multiple args to the vLLM's OpenAI inference service entrypoint that allows the user to specify the path to serialized-by-tensorizer model tensors, as well as arguments for tensorizer's deserializer.
- Allows deserialization of serialized model tensors in HuggingFace's model format,
as well as supporting deserializing serialized vLLM-formatted models, allowing the
use of loading with
plaid_mode, which can allow Llama 2 13B non-locally to start serving requests in as little as 10 seconds. Also supports encrypting and decrypting model tensors. - Adds a
tensorize_vllm_model.pyscript toexamples/that allows vLLM models to be serialized and deserialized withtensorizer.
Credentialing for S3 is supported by passing a user's access and secret key to
S3_ACCESS_KEY_ID and S3_SECRET_ACCESS_KEY environment variables respectively. It can also be specified as CLI args to the api server entrypoint.
Model loading benchmarks
Tensorizer can load models like Llama 2 13B in as little as 10 seconds. In order to do so, a model must be
serialized using TensorSerializer to a .tensors file located either locally or through a S3, HTTP/HTTPS, or Redis
endpoint. --tensorizer-uri must be specified with the serialized tensors location when invoking the API server.
Example usage:
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--model EleutherAI/pythia-6.9b \
--load-format tensorizer \
--tensorizer-uri s3://tensorized/EleutherAI/pythia-6.9b/fp16/model.tensors
If a vLLM model is serialized, plaid_mode can be used, which loads much faster. The following plot demonstrates model loading time benchmarks for vLLM's OpenAI-compatible inference server on a Nvidia A40 GPU.
Tensorizer is so fast that it loads models faster than Safetensors even locally.
@cadedaniel @rkooo567 Pinging for an assigned reviewer from someone on the team when possible!
@cadedaniel @rkooo567 @Yard1 @WoosukKwon @zhuohan123 @ywang96
All tests are passing. Can I get eyes on this please? Cheers!
Hello! Just pinging the review team for an assigned reviewer. This would be a hugely impactful feature for vLLM users, especially those who prioritize responsiveness during inference, tensor encryption during loading, or extremely fast model loading via HTTP! Please refer to my benchmarks here and the feature request I made for more on that.
@cadedaniel @rkooo567 @Yard1 @WoosukKwon @zhuohan123 @ywang96 @esmeetu @hmellor @liangfu @LiuXiaoxuanPKU @simon-mo
EDIT: I see this has been mentioned as part of vLLM's feature roadmap! Apologies for the pings and delighted to see it on the roadmap!
Hey @sangstar ! Thank you for the contribution - I will take a look at this PR next week and test it out!
Out of curiosity, how does support for tensor parallelism sharded models is looking like with tensorizer?
I see it's possible but we'd need some extra code to support different paths depending on the shard rank. Maybe we could pass a list of paths (assert n == world_size) and have each rank grab the corresponding path?
Also, I tested this using the following workflow:
- Use the provided script to generate local files for
meta-llama/Llama-2-7b-chat-hf - Modify
examples/offline_inference.pywith following:
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
tensor_parallel_size=1,
load_format="tensorizer",
tensorizer_args=TensorizerArgs(
tensorizer_uri="/PATH/TO/vllm/model.tensors",
num_readers=1,
),
trust_remote_code=True)
And I am getting gibberish output. Any idea what's wrong? EDIT: looks to be related to RotaryEmbedding
We should definitely test this code path, and ensure weights and outputs are identical between safetensors and tensorizer.
Out of curiosity, how does support for tensor parallelism sharded models is looking like with tensorizer?
Tensorizer currently doesn't support tensor parallelism sharded models, but tensors can currently be loaded into one GPU and then transferred device-to-device if there is space.
Also, I tested this using the following workflow:
- Use the provided script to generate local files for
meta-llama/Llama-2-7b-chat-hf- Modify
examples/offline_inference.pywith following:llm = LLM(model="meta-llama/Llama-2-7b-chat-hf", tensor_parallel_size=1, load_format="tensorizer", tensorizer_args=TensorizerArgs( tensorizer_uri="/PATH/TO/vllm/model.tensors", num_readers=1, ), trust_remote_code=True)And I am getting gibberish output. Any idea what's wrong?
We should definitely test this code path, and ensure weights are identical between safetensors and tensorizer.
On my side my output doesn't seem completely nonsensical, and this is with a altered-to-be-lower max context length:
Prompt: 'Hello, my name is', Generated text: ' Dustin Nelson and I’m going to be your tutor!\n'
Prompt: 'The president of the United States is', Generated text: ' a member of an exclusive club.\nDonald Trump is not the only person to'
Prompt: 'The capital of France is', Generated text: ' one of the most visited cities in Europe, and its location makes it a great'
Prompt: 'The future of AI is', Generated text: ' all about interoperability\nHere’s why developers and businesses will start'
Perhaps it has to do with your serialized model tensors? Make sure to distinguish between the working definitions of a vLLM-formatted model and a HF-formatted model here. If you're trying to deserialize a vLLM model, make sure that its tensorizer_uri has vllm in it somewhere. Otherwise, if you're trying to deserialize a HF model, it shouldn't have vllm in the tensorizer_uri (I understand that this is a bit hacky, as per your comment, so I'll change that).
Please note that the model also needs to be serialized beforehand. My example script shows how to serialize a vLLM-formatted model. Serializing a model from the HF model hub is also straight-forward, as pulled from the README.md for Tensorizer:
import torch
from tensorizer import TensorSerializer
from transformers import AutoModelForCausalLM
model_ref = "EleutherAI/gpt-j-6B"
# For less intensive requirements, swap above with the line below:
# model_ref = "EleutherAI/gpt-neo-125M"
model_name = model_ref.split("/")[-1]
# Change this to your S3 bucket.
s3_bucket = "bucket"
s3_uri = f"s3://{s3_bucket}/{model_name}.tensors"
model = AutoModelForCausalLM.from_pretrained(
model_ref,
revision="float16",
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
)
serializer = TensorSerializer(s3_uri)
serializer.write_module(model)
serializer.close()
Are you able to provide more context in to how you ran into this problem? I'm also happy to write a test confirming that the tensors aren't altered, although I'm not sure how painful that will be for the testing suite, as serializing is a naturally longer operation than deserializing.
Edit:
looks to be related to RotaryEmbedding
Ah okay, I'll assume that was the issue then
@sangstar I ran
python tensorize_vllm_model.py --model meta-llama/Llama-2-7b-chat-hf serialize --serialized-directory /PATH/TO/vllm and then used the created on-disk file to load the model with
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
tensor_parallel_size=1,
load_format="tensorizer",
tensorizer_args=TensorizerArgs(
tensorizer_uri="/PATH/TO/vllm/model.tensors",
num_readers=1,
),
trust_remote_code=True)
This is what I am getting as output:
Prompt: 'Hello, my name is', Generated text: ''
Prompt: 'The president of the United States is', Generated text: 'ajeguskyExcelристи upload Richmondegu Microsoft Hier Excel Nilimen arr intent'
Prompt: 'The capital of France is', Generated text: ' Offiznofollow Krit KritConfigurationonnenigeristiquedefin awkbibliothek Convertweenстваording'
Prompt: 'The future of AI is', Generated text: ''
For comparison, this is what I get without tensorizer:
Prompt: 'Hello, my name is', Generated text: " Sherry and I'm a 35-year-old woman from"
Prompt: 'The president of the United States is', Generated text: ' a member of Congress, which means that he or she is subject to the same'
Prompt: 'The capital of France is', Generated text: ' Paris. This is a fact that is well known and widely accepted. However,'
Prompt: 'The future of AI is', Generated text: ' likely to be shaped by a combination of technological advancements, soci'
I hardcoded _is_vllm_model to always return True. That was the only change to the code on my side.
also digging a little deeper into it and I am not sure about RotaryEmbedding anymore. I thought that was it as it was generated when model class was initialized, but it looks like whatever is loaded should be identical. I compared the weight tensors after loading with tensorizer and without and the model looks to be identical, but the outputs are different - very weird...
I am getting gibberish with EleutherAI/gpt-j-6B as well. Let me know if there is any way I can help.
I am running unmodified code from this PR in a fresh environment on a g5.12xlarge instance.
I figured it out, it's an issue with linear_weights attribute present on all of vLLM's linear layers. I will make a separate PR to fix it. The workaround for this PR is to add
for child in self.model.modules():
if hasattr(child, "linear_weights"):
for name, weight in child.linear_weights.items():
if isinstance(weight, torch.Tensor):
child.linear_weights[name] = getattr(child, name)
in TensorizerAgent.deserialize() after deserializing the weights.
Let's add e2e tests with a small model (like opt 125m) that ensure the output is exactly the same with and without tensorizer.
I figured it out, it's an issue with
linear_weightsattribute present on all of vLLM's linear layers. I will make a separate PR to fix it. The workaround for this PR is to addfor child in self.model.modules(): if hasattr(child, "linear_weights"): for name, weight in child.linear_weights.items(): if isinstance(weight, torch.Tensor): child.linear_weights[name] = getattr(child, name)in
TensorizerAgent.deserialize()after deserializing the weights.Let's add e2e tests with a small model (like opt 125m) that ensure the output is exactly the same with and without tensorizer.
Was currently working on this too for a substantive reply. Thanks for that great catch! I'll add a test for this as you mentioned in this PR, as well as getting to your other unresolved comments.
I've opened https://github.com/vllm-project/vllm/pull/3977 and I confirmed that applying it on top of this PR fixes the issue.
@sangstar We can easily support LoRA with the following workaround in TensorizerAgent:
def _resize_lora_embeddings(self):
"""Modify LoRA embedding layers to use bigger tensors
to allow for adapter added tokens."""
for child in self.model.modules():
if (isinstance(child, VocabParallelEmbedding)
and child.weight.shape[0] <
child.num_embeddings_per_partition):
new_weight = torch.empty(child.num_embeddings_per_partition,
child.embedding_dim,
dtype=child.weight.dtype,
device=child.weight.device)
new_weight[:child.weight.shape[0]].copy_(child.weight.data)
new_weight[child.weight.shape[0]:].fill_(0)
child.weight.data = new_weight
This could theoretically lead to memory fragmentation but empirically I don't see a difference in the number of available GPU blocks.
We also need to modify TensorizerAgent to take in kwargs to pass to the model class.
Then the model loading code can become
with _set_default_torch_dtype(model_config.dtype):
# Create a model instance.
# The weights will be initialized as empty tensors.
extra_kwargs = {}
if hasattr(model_class, "supported_lora_modules"):
extra_kwargs["lora_config"] = lora_config
elif lora_config:
raise ValueError(
f"Model {model_class.__name__} does not support LoRA, "
"but LoRA is enabled. Support for this model may "
"be added in the future. If this is important to you, "
"please open an issue on github.")
elif model_class in _VISION_MODEL_CLASSES:
extra_kwargs["vision_language_config"] = vision_language_config
with torch.device(device_config.device):
if model_config.load_format == "tensorizer" and _is_vllm_model(
model_config):
model = load_with_tensorizer(model_class,
model_config,
linear_method=linear_method,
**extra_kwargs)
return model.eval()
model = model_class(config=model_config.hf_config,
linear_method=linear_method,
**extra_kwargs)
@sangstar We can easily support LoRA with the following workaround in
TensorizerAgent:def _resize_lora_embeddings(self): """Modify LoRA embedding layers to use bigger tensors to allow for adapter added tokens.""" for child in self.model.modules(): if (isinstance(child, VocabParallelEmbedding) and child.weight.shape[0] < child.num_embeddings_per_partition): new_weight = torch.empty(child.num_embeddings_per_partition, child.embedding_dim, dtype=child.weight.dtype, device=child.weight.device) new_weight[:child.weight.shape[0]].copy_(child.weight.data) new_weight[child.weight.shape[0]:].fill_(0) child.weight.data = new_weightThis could theoretically lead to memory fragmentation but empirically I don't see a difference in the number of available GPU blocks.
We also need to modify
TensorizerAgentto take inkwargsto pass to the model class.Then the model loading code can become
with _set_default_torch_dtype(model_config.dtype): # Create a model instance. # The weights will be initialized as empty tensors. extra_kwargs = {} if hasattr(model_class, "supported_lora_modules"): extra_kwargs["lora_config"] = lora_config elif lora_config: raise ValueError( f"Model {model_class.__name__} does not support LoRA, " "but LoRA is enabled. Support for this model may " "be added in the future. If this is important to you, " "please open an issue on github.") elif model_class in _VISION_MODEL_CLASSES: extra_kwargs["vision_language_config"] = vision_language_config with torch.device(device_config.device): if model_config.load_format == "tensorizer" and _is_vllm_model( model_config): model = load_with_tensorizer(model_class, model_config, linear_method=linear_method, **extra_kwargs) return model.eval() model = model_class(config=model_config.hf_config, linear_method=linear_method, **extra_kwargs)
Wonderful! I've attempted to implement those changes here. Let me know if I got that wrong. It seems to work with a vLLM model using examples/multilora_inference.py!
Let me think about this!
@Yard1 I have a initial refactor for you using TensorizerConfig that makes things less hacky. EngineArgs.from_cli_args is back to normal. Eager to hear what you think.
Thanks, this is looking really good! Now that #3977 is merged can you merge master and remove the workaround?
Merged master and removed the workaround!
Added several more docs changes and tests -- including integration tests for deserializing from s3, binding LoRA adapters to a vLLM model and testing the example script serializing and deserializing successfully, and running the OpenAI api server with tensorizer, all of which are passing on my side.
Added additional tests and some fixes; checks are all passing!
@ywang96 @Yard1 @rkooo567
Thank you all very much for your reviews! I've implemented the changes from @ywang96 's comments. To summarize:
- An error is raised if the tensor parallel size exceeds 1 and attempting to use Tensorizer (test added)
- The serialization step in
examples/tensorize_vllm_model.pynow instantiates the model to serialize usingLLMEngine - Meta tensors found when deserializing will raise an error
- Removed forcing
float16from the parser forexamples/tensorize_vllm_model.py - I've also additionally added a
PerformanceWarningwhen trying to load a tensorized model with quantization, as that is a bit unstable at the moment (I may try to look in to this in another PR) (test added). - Added the Tensorizer testing folder for the CI suite
Some minor fixes to ensure the testing suite can run the tensorizer tests. All passing! Thanks very much again for the reviews @rkooo567 @Yard1 @ywang96 let me know if anything else needed! :)
I was able to successfully test this in a vLLM 0.4.1 container running on OpenShift, both with models serialized with the Tensorizer library directly and for vLLM-serialized models. Once I cranked up the Pod's CPU and increased the num_readers parameter, I got about an 8x speedup in my case when loading the same model via vLLM-serialized tensorize files compared to not using tensorizer at all and just downloading the safetensors from S3 to local disk then loading with vLLM. This took my overall cold start time of this Pod from a bit over 4 minutes to 30 seconds. There may be even more performance available in my setup with additional tweaking, but this is already a great win.
INFO 05-03 11:11:38 tensorizer.py:337] Deserialized 14.5 GB in 15.21s, 953.1 MB/s
That's an awesome improvement, and thank you!
I was able to successfully test this in a vLLM 0.4.1 container running on OpenShift, both with models serialized with the Tensorizer library directly and for vLLM-serialized models. Once I cranked up the Pod's CPU and increased the
num_readersparameter, I got about an 8x speedup in my case when loading the same model via vLLM-serialized tensorize files compared to not using tensorizer at all and just downloading the safetensors from S3 to local disk then loading with vLLM. This took my overall cold start time of this Pod from a bit over 4 minutes to 30 seconds. There may be even more performance available in my setup with additional tweaking, but this is already a great win.
INFO 05-03 11:11:38 tensorizer.py:337] Deserialized 14.5 GB in 15.21s, 953.1 MB/sThat's an awesome improvement, and thank you!
I'm thrilled to hear that! I currently actually have a new PR up #4208 that uses the full 2.9.0 release, has better usage documentation, and automated inferring a vLLM-serialized model.