Support for `compressed-tensors`
The goal of this PR is to support the weight loading from the compressed safetensor representation.
The compressed safetensor representation has been introduced by Neural Magic, and implemented by @Satrat .
Really like how the .decompress() function looks. Makes the interface really clean
@dbogunowicz one additional thing that needs to be done:
Right now, the user has to specify that it should use the sparse kernels manually:
from vllm import LLM
# loads as sparse
model = LLM("/path/to/sparse/model", sparsity="sparse_w16a16")
# loads as dense
model = LLM("/path/to/sparse/model")
Ideally, we should automatically detect if the model is sparse based on the config and load it if so:
from vllm import LLM
# loads as sparse
model = LLM("/path/to/sparse/model")
This is how things work for quantization. I left a placeholder when I originally integrated the sparse kernels for this logic here.
Can you add this?
Finally, please add some end-to-end testing, which loads the compressed model and runs inference
I would suggest the following format:
- Take an existing small sparse model (
neuralmagic/llama2.c-stories110M-pruned50) - Save a compressed version, push this model up to
nm-testing - Use the
tests/models/test_model_logprobs.pyformat to compare the outputs of the existing uncompressed version to the compressed version