nm-vllm icon indicating copy to clipboard operation
nm-vllm copied to clipboard

Support for `compressed-tensors`

Open dbogunowicz opened this issue 1 year ago • 2 comments

The goal of this PR is to support the weight loading from the compressed safetensor representation. The compressed safetensor representation has been introduced by Neural Magic, and implemented by @Satrat .

dbogunowicz avatar Apr 02 '24 10:04 dbogunowicz

Really like how the .decompress() function looks. Makes the interface really clean

@dbogunowicz one additional thing that needs to be done:

Right now, the user has to specify that it should use the sparse kernels manually:

from vllm import LLM

# loads as sparse
model = LLM("/path/to/sparse/model", sparsity="sparse_w16a16")

# loads as dense
model = LLM("/path/to/sparse/model")

Ideally, we should automatically detect if the model is sparse based on the config and load it if so:

from vllm import LLM

# loads as sparse
model = LLM("/path/to/sparse/model")

This is how things work for quantization. I left a placeholder when I originally integrated the sparse kernels for this logic here.

Can you add this?

robertgshaw2-redhat avatar Apr 03 '24 13:04 robertgshaw2-redhat

Finally, please add some end-to-end testing, which loads the compressed model and runs inference

I would suggest the following format:

  • Take an existing small sparse model (neuralmagic/llama2.c-stories110M-pruned50)
  • Save a compressed version, push this model up to nm-testing
  • Use the tests/models/test_model_logprobs.py format to compare the outputs of the existing uncompressed version to the compressed version

robertgshaw2-redhat avatar Apr 03 '24 13:04 robertgshaw2-redhat