Tensorizer Support

This PR allows models used for the OpenAI-compatible API server to be loaded using Coreweave's Tensorizer, enabling extremely fast (faster than cached Safetensors) model loads from HTTP/HTTPS, Redis, and S3 endpoints.

The key changes involved are:

Listing tensorizer>=2.8.1 as a runtime dependency in requirements.txt
Adds tensorizer_loader.py to vllm/model_executor that provides utility functions for tensorizer.
Adds multiple args to the vLLM's OpenAI inference service entrypoint that allows the user to specify the path to serialized-by-tensorizer model tensors, as well as arguments for tensorizer's deserializer.
Allows deserialization of serialized model tensors in HuggingFace's model format, as well as supporting deserializing serialized vLLM-formatted models, allowing the use of loading with plaid_mode, which can allow Llama 2 13B non-locally to start serving requests in as little as 10 seconds. Also supports encrypting and decrypting model tensors.
Adds a tensorize_vllm_model.py script to examples/ that allows vLLM models to be serialized and deserialized with tensorizer.

Credentialing for S3 is supported by passing a user's access and secret key to S3_ACCESS_KEY_ID and S3_SECRET_ACCESS_KEY environment variables respectively. It can also be specified as CLI args to the api server entrypoint.

Model loading benchmarks

Tensorizer can load models like Llama 2 13B in as little as 10 seconds. In order to do so, a model must be serialized using TensorSerializer to a .tensors file located either locally or through a S3, HTTP/HTTPS, or Redis endpoint. --tensorizer-uri must be specified with the serialized tensors location when invoking the API server.

Example usage:

python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--model EleutherAI/pythia-6.9b \
--load-format tensorizer \
--tensorizer-uri s3://tensorized/EleutherAI/pythia-6.9b/fp16/model.tensors

If a vLLM model is serialized, plaid_mode can be used, which loads much faster. The following plot demonstrates model loading time benchmarks for vLLM's OpenAI-compatible inference server on a Nvidia A40 GPU.

Tensorizer is so fast that it loads models faster than Safetensors even locally.

Benchmark (10)

Mar 18 '24 20:03 sangstar

@cadedaniel @rkooo567 Pinging for an assigned reviewer from someone on the team when possible!

Mar 25 '24 19:03 sangstar

@cadedaniel @rkooo567 @Yard1 @WoosukKwon @zhuohan123 @ywang96

All tests are passing. Can I get eyes on this please? Cheers!

Apr 03 '24 20:04 sangstar

Hello! Just pinging the review team for an assigned reviewer. This would be a hugely impactful feature for vLLM users, especially those who prioritize responsiveness during inference, tensor encryption during loading, or extremely fast model loading via HTTP! Please refer to my benchmarks here and the feature request I made for more on that.

@cadedaniel @rkooo567 @Yard1 @WoosukKwon @zhuohan123 @ywang96 @esmeetu @hmellor @liangfu @LiuXiaoxuanPKU @simon-mo

EDIT: I see this has been mentioned as part of vLLM's feature roadmap! Apologies for the pings and delighted to see it on the roadmap!

Apr 05 '24 17:04 sangstar

Hey @sangstar ! Thank you for the contribution - I will take a look at this PR next week and test it out!

Apr 08 '24 05:04 ywang96

Out of curiosity, how does support for tensor parallelism sharded models is looking like with tensorizer?

I see it's possible but we'd need some extra code to support different paths depending on the shard rank. Maybe we could pass a list of paths (assert n == world_size) and have each rank grab the corresponding path?

Apr 09 '24 18:04 Yard1

Also, I tested this using the following workflow:

Use the provided script to generate local files for meta-llama/Llama-2-7b-chat-hf
Modify examples/offline_inference.py with following:

llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
          tensor_parallel_size=1,
          load_format="tensorizer",
          tensorizer_args=TensorizerArgs(
              tensorizer_uri="/PATH/TO/vllm/model.tensors",
              num_readers=1,
          ),
          trust_remote_code=True)

And I am getting gibberish output. Any idea what's wrong? EDIT: looks to be related to RotaryEmbedding

We should definitely test this code path, and ensure weights and outputs are identical between safetensors and tensorizer.

Apr 09 '24 22:04 Yard1

Out of curiosity, how does support for tensor parallelism sharded models is looking like with tensorizer?

Tensorizer currently doesn't support tensor parallelism sharded models, but tensors can currently be loaded into one GPU and then transferred device-to-device if there is space.

Also, I tested this using the following workflow:

Use the provided script to generate local files for meta-llama/Llama-2-7b-chat-hf

Modify examples/offline_inference.py with following:
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
          tensor_parallel_size=1,
          load_format="tensorizer",
          tensorizer_args=TensorizerArgs(
              tensorizer_uri="/PATH/TO/vllm/model.tensors",
              num_readers=1,
          ),
          trust_remote_code=True)
And I am getting gibberish output. Any idea what's wrong?

We should definitely test this code path, and ensure weights are identical between safetensors and tensorizer.

On my side my output doesn't seem completely nonsensical, and this is with a altered-to-be-lower max context length:

Prompt: 'Hello, my name is', Generated text: ' Dustin Nelson and I’m going to be your tutor!\n'
Prompt: 'The president of the United States is', Generated text: ' a member of an exclusive club.\nDonald Trump is not the only person to'
Prompt: 'The capital of France is', Generated text: ' one of the most visited cities in Europe, and its location makes it a great'
Prompt: 'The future of AI is', Generated text: ' all about interoperability\nHere’s why developers and businesses will start'

Perhaps it has to do with your serialized model tensors? Make sure to distinguish between the working definitions of a vLLM-formatted model and a HF-formatted model here. If you're trying to deserialize a vLLM model, make sure that its tensorizer_uri has vllm in it somewhere. Otherwise, if you're trying to deserialize a HF model, it shouldn't have vllm in the tensorizer_uri (I understand that this is a bit hacky, as per your comment, so I'll change that).

Please note that the model also needs to be serialized beforehand. My example script shows how to serialize a vLLM-formatted model. Serializing a model from the HF model hub is also straight-forward, as pulled from the README.md for Tensorizer:

import torch
from tensorizer import TensorSerializer
from transformers import AutoModelForCausalLM

model_ref = "EleutherAI/gpt-j-6B"
# For less intensive requirements, swap above with the line below:
# model_ref = "EleutherAI/gpt-neo-125M"
model_name = model_ref.split("/")[-1]
# Change this to your S3 bucket.
s3_bucket = "bucket"
s3_uri = f"s3://{s3_bucket}/{model_name}.tensors"

model = AutoModelForCausalLM.from_pretrained(
    model_ref,
    revision="float16",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
)

serializer = TensorSerializer(s3_uri)
serializer.write_module(model)
serializer.close()

Are you able to provide more context in to how you ran into this problem? I'm also happy to write a test confirming that the tensors aren't altered, although I'm not sure how painful that will be for the testing suite, as serializing is a naturally longer operation than deserializing.

Edit:

looks to be related to RotaryEmbedding

Ah okay, I'll assume that was the issue then

Apr 09 '24 23:04 sangstar

@sangstar I ran python tensorize_vllm_model.py --model meta-llama/Llama-2-7b-chat-hf serialize --serialized-directory /PATH/TO/vllm and then used the created on-disk file to load the model with

llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
          tensor_parallel_size=1,
          load_format="tensorizer",
          tensorizer_args=TensorizerArgs(
              tensorizer_uri="/PATH/TO/vllm/model.tensors",
              num_readers=1,
          ),
          trust_remote_code=True)

This is what I am getting as output:

Prompt: 'Hello, my name is', Generated text: ''
Prompt: 'The president of the United States is', Generated text: 'ajeguskyExcelристи upload Richmondegu Microsoft Hier Excel Nilimen arr intent'
Prompt: 'The capital of France is', Generated text: ' Offiznofollow Krit KritConfigurationonnenigeristiquedefin awkbibliothek Convertweenстваording'
Prompt: 'The future of AI is', Generated text: ''

For comparison, this is what I get without tensorizer:

Prompt: 'Hello, my name is', Generated text: " Sherry and I'm a 35-year-old woman from"
Prompt: 'The president of the United States is', Generated text: ' a member of Congress, which means that he or she is subject to the same'
Prompt: 'The capital of France is', Generated text: ' Paris. This is a fact that is well known and widely accepted. However,'
Prompt: 'The future of AI is', Generated text: ' likely to be shaped by a combination of technological advancements, soci'

I hardcoded _is_vllm_model to always return True. That was the only change to the code on my side.

also digging a little deeper into it and I am not sure about RotaryEmbedding anymore. I thought that was it as it was generated when model class was initialized, but it looks like whatever is loaded should be identical. I compared the weight tensors after loading with tensorizer and without and the model looks to be identical, but the outputs are different - very weird...

Apr 09 '24 23:04 Yard1

I am getting gibberish with EleutherAI/gpt-j-6B as well. Let me know if there is any way I can help.

I am running unmodified code from this PR in a fresh environment on a g5.12xlarge instance.

Apr 10 '24 00:04 Yard1

I figured it out, it's an issue with linear_weights attribute present on all of vLLM's linear layers. I will make a separate PR to fix it. The workaround for this PR is to add

        for child in self.model.modules():
            if hasattr(child, "linear_weights"):
                for name, weight in child.linear_weights.items():
                    if isinstance(weight, torch.Tensor):
                        child.linear_weights[name] = getattr(child, name)

in TensorizerAgent.deserialize() after deserializing the weights.

Let's add e2e tests with a small model (like opt 125m) that ensure the output is exactly the same with and without tensorizer.

Apr 10 '24 02:04 Yard1

I figured it out, it's an issue with linear_weights attribute present on all of vLLM's linear layers. I will make a separate PR to fix it. The workaround for this PR is to add
        for child in self.model.modules():
            if hasattr(child, "linear_weights"):
                for name, weight in child.linear_weights.items():
                    if isinstance(weight, torch.Tensor):
                        child.linear_weights[name] = getattr(child, name)
in TensorizerAgent.deserialize() after deserializing the weights.

Let's add e2e tests with a small model (like opt 125m) that ensure the output is exactly the same with and without tensorizer.

Was currently working on this too for a substantive reply. Thanks for that great catch! I'll add a test for this as you mentioned in this PR, as well as getting to your other unresolved comments.

Apr 10 '24 02:04 sangstar

I've opened https://github.com/vllm-project/vllm/pull/3977 and I confirmed that applying it on top of this PR fixes the issue.

Apr 10 '24 17:04 Yard1

@sangstar We can easily support LoRA with the following workaround in TensorizerAgent:

    def _resize_lora_embeddings(self):
        """Modify LoRA embedding layers to use bigger tensors
        to allow for adapter added tokens."""
        for child in self.model.modules():
            if (isinstance(child, VocabParallelEmbedding)
                    and child.weight.shape[0] <
                    child.num_embeddings_per_partition):
                new_weight = torch.empty(child.num_embeddings_per_partition,
                                         child.embedding_dim,
                                         dtype=child.weight.dtype,
                                         device=child.weight.device)
                new_weight[:child.weight.shape[0]].copy_(child.weight.data)
                new_weight[child.weight.shape[0]:].fill_(0)
                child.weight.data = new_weight

This could theoretically lead to memory fragmentation but empirically I don't see a difference in the number of available GPU blocks.

We also need to modify TensorizerAgent to take in kwargs to pass to the model class.

Then the model loading code can become

    with _set_default_torch_dtype(model_config.dtype):
        # Create a model instance.
        # The weights will be initialized as empty tensors.
        extra_kwargs = {}
        if hasattr(model_class, "supported_lora_modules"):
            extra_kwargs["lora_config"] = lora_config
        elif lora_config:
            raise ValueError(
                f"Model {model_class.__name__} does not support LoRA, "
                "but LoRA is enabled. Support for this model may "
                "be added in the future. If this is important to you, "
                "please open an issue on github.")
        elif model_class in _VISION_MODEL_CLASSES:
            extra_kwargs["vision_language_config"] = vision_language_config

        with torch.device(device_config.device):
            if model_config.load_format == "tensorizer" and _is_vllm_model(
                    model_config):
                model = load_with_tensorizer(model_class,
                                             model_config,
                                             linear_method=linear_method,
                                             **extra_kwargs)
                return model.eval()
            model = model_class(config=model_config.hf_config,
                                linear_method=linear_method,
                                **extra_kwargs)

Apr 10 '24 19:04 Yard1

@sangstar We can easily support LoRA with the following workaround in TensorizerAgent:

    def _resize_lora_embeddings(self):
        """Modify LoRA embedding layers to use bigger tensors
        to allow for adapter added tokens."""
        for child in self.model.modules():
            if (isinstance(child, VocabParallelEmbedding)
                    and child.weight.shape[0] <
                    child.num_embeddings_per_partition):
                new_weight = torch.empty(child.num_embeddings_per_partition,
                                         child.embedding_dim,
                                         dtype=child.weight.dtype,
                                         device=child.weight.device)
                new_weight[:child.weight.shape[0]].copy_(child.weight.data)
                new_weight[child.weight.shape[0]:].fill_(0)
                child.weight.data = new_weight

This could theoretically lead to memory fragmentation but empirically I don't see a difference in the number of available GPU blocks.

We also need to modify TensorizerAgent to take in kwargs to pass to the model class.

Then the model loading code can become

    with _set_default_torch_dtype(model_config.dtype):
        # Create a model instance.
        # The weights will be initialized as empty tensors.
        extra_kwargs = {}
        if hasattr(model_class, "supported_lora_modules"):
            extra_kwargs["lora_config"] = lora_config
        elif lora_config:
            raise ValueError(
                f"Model {model_class.__name__} does not support LoRA, "
                "but LoRA is enabled. Support for this model may "
                "be added in the future. If this is important to you, "
                "please open an issue on github.")
        elif model_class in _VISION_MODEL_CLASSES:
            extra_kwargs["vision_language_config"] = vision_language_config

        with torch.device(device_config.device):
            if model_config.load_format == "tensorizer" and _is_vllm_model(
                    model_config):
                model = load_with_tensorizer(model_class,
                                             model_config,
                                             linear_method=linear_method,
                                             **extra_kwargs)
                return model.eval()
            model = model_class(config=model_config.hf_config,
                                linear_method=linear_method,
                                **extra_kwargs)

Wonderful! I've attempted to implement those changes here. Let me know if I got that wrong. It seems to work with a vLLM model using examples/multilora_inference.py!

Apr 10 '24 20:04 sangstar

Let me think about this!

@Yard1 I have a initial refactor for you using TensorizerConfig that makes things less hacky. EngineArgs.from_cli_args is back to normal. Eager to hear what you think.

Apr 11 '24 14:04 sangstar

Thanks, this is looking really good! Now that #3977 is merged can you merge master and remove the workaround?

Merged master and removed the workaround!

Apr 11 '24 21:04 sangstar

Added several more docs changes and tests -- including integration tests for deserializing from s3, binding LoRA adapters to a vLLM model and testing the example script serializing and deserializing successfully, and running the OpenAI api server with tensorizer, all of which are passing on my side.

Apr 12 '24 14:04 sangstar

Added additional tests and some fixes; checks are all passing!

Apr 12 '24 20:04 sangstar

@ywang96 @Yard1 @rkooo567

Thank you all very much for your reviews! I've implemented the changes from @ywang96 's comments. To summarize:

An error is raised if the tensor parallel size exceeds 1 and attempting to use Tensorizer (test added)
The serialization step in examples/tensorize_vllm_model.py now instantiates the model to serialize using LLMEngine
Meta tensors found when deserializing will raise an error
Removed forcing float16 from the parser for examples/tensorize_vllm_model.py
I've also additionally added a PerformanceWarning when trying to load a tensorized model with quantization, as that is a bit unstable at the moment (I may try to look in to this in another PR) (test added).
Added the Tensorizer testing folder for the CI suite

Apr 13 '24 21:04 sangstar

Some minor fixes to ensure the testing suite can run the tensorizer tests. All passing! Thanks very much again for the reviews @rkooo567 @Yard1 @ywang96 let me know if anything else needed! :)

Apr 14 '24 00:04 sangstar

I was able to successfully test this in a vLLM 0.4.1 container running on OpenShift, both with models serialized with the Tensorizer library directly and for vLLM-serialized models. Once I cranked up the Pod's CPU and increased the num_readers parameter, I got about an 8x speedup in my case when loading the same model via vLLM-serialized tensorize files compared to not using tensorizer at all and just downloading the safetensors from S3 to local disk then loading with vLLM. This took my overall cold start time of this Pod from a bit over 4 minutes to 30 seconds. There may be even more performance available in my setup with additional tweaking, but this is already a great win.

INFO 05-03 11:11:38 tensorizer.py:337] Deserialized 14.5 GB in 15.21s, 953.1 MB/s

That's an awesome improvement, and thank you!

May 03 '24 16:05 bbrowning

I was able to successfully test this in a vLLM 0.4.1 container running on OpenShift, both with models serialized with the Tensorizer library directly and for vLLM-serialized models. Once I cranked up the Pod's CPU and increased the num_readers parameter, I got about an 8x speedup in my case when loading the same model via vLLM-serialized tensorize files compared to not using tensorizer at all and just downloading the safetensors from S3 to local disk then loading with vLLM. This took my overall cold start time of this Pod from a bit over 4 minutes to 30 seconds. There may be even more performance available in my setup with additional tweaking, but this is already a great win.

INFO 05-03 11:11:38 tensorizer.py:337] Deserialized 14.5 GB in 15.21s, 953.1 MB/s

That's an awesome improvement, and thank you!

I'm thrilled to hear that! I currently actually have a new PR up #4208 that uses the full 2.9.0 release, has better usage documentation, and automated inferring a vLLM-serialized model.

May 03 '24 16:05 sangstar

[Frontend] [Core] feat: Add model loading using `tensorizer`

Tensorizer Support

Model loading benchmarks