nm-vllm
nm-vllm copied to clipboard
Initial `CompressedTensors` config + Activation Quantization support …
Summary
- Initial implementation for Compressed Config support + Activation Quantization for static per tensor w8a8
- Includes fused kernels added by @varun-sundar-rabindranath
Testing/Sample Script:
from vllm import LLM, SamplingParams
import torch
# Sample prompts.
prompts = [
"Hello, my name is",
"The capital of France is",
"The US president is",
"The future of AI is"
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.80, top_p=0.95)
# Create an LLM.
llm = LLM(model="nm-testing/tinyllama-one-shot-static-quant-test", enforce_eager=True, dtype=torch.float32, quantization="sparseml")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Next Steps:
- Verification of the different inputs expected for
targetsandignore--> use functions to parse the layer names which can be shared by sparseml and vllm; would live in compressed tensors (https://github.com/neuralmagic/compressed-tensors/blob/67005d76107d4659787f1efd53fe7e6b1d192818/src/compressed_tensors/quantization/lifecycle/apply.py#L86)