nm-vllm icon indicating copy to clipboard operation
nm-vllm copied to clipboard

Initial `CompressedTensors` config + Activation Quantization support …

Open dsikka opened this issue 1 year ago • 0 comments

Summary

  • Initial implementation for Compressed Config support + Activation Quantization for static per tensor w8a8
  • Includes fused kernels added by @varun-sundar-rabindranath

Testing/Sample Script:

from vllm import LLM, SamplingParams
import torch

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The US president is",
    "The future of AI is"
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.80, top_p=0.95)

# Create an LLM.
llm = LLM(model="nm-testing/tinyllama-one-shot-static-quant-test", enforce_eager=True, dtype=torch.float32, quantization="sparseml")

outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Next Steps:

  • Verification of the different inputs expected for targets and ignore --> use functions to parse the layer names which can be shared by sparseml and vllm; would live in compressed tensors (https://github.com/neuralmagic/compressed-tensors/blob/67005d76107d4659787f1efd53fe7e6b1d192818/src/compressed_tensors/quantization/lifecycle/apply.py#L86)

dsikka avatar Apr 30 '24 17:04 dsikka