nm-vllm
nm-vllm copied to clipboard
Prototype FP8Linear W8A8 runtime quantization
Adds FP8 quantization at runtime for both weights and activations using torch.float8_e4m3fn
torch._scaled_mm provides an W8A8 linear kernel for FP8, but is only supported on CUDA devices with compute capability >= 9.0 for torch==2.2.1.
RuntimeError: torch._scaled_mm is only supported on devices with compute capability >= 9.0)
It has been expanded to CUDA 8.9, or ROCm MI300+ on main, but won't be on a stable release for a while.
This means for CUDA devices with compute capability < 9.0 (currently everything below Hopper), the weights will be dequantized into higher precision offering no compute savings.
Original precision bfloat16:
from vllm import LLM, SamplingParams
model = LLM("teknium/OpenHermes-2.5-Mistral-7B", enforce_eager=True)
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Once upon a time, ", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
"""
INFO 04-15 18:29:45 model_runner.py:166] Loading model weights took 13.4976 GB
10 years ago, I was a young, naive, and inexperienced 20-year-old. I had just graduated from college and was about to embark on my first job as a software engineer. I was excited, nervous, and scared all at the same time. I had no idea what to expect, but I was ready to take on the world.
"""
Quantized to FP8, specifically float8_e4m3fn:
from vllm import LLM, SamplingParams
model = LLM("teknium/OpenHermes-2.5-Mistral-7B", enforce_eager=True, quantization="fp8")
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Once upon a time, ", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
"""
INFO 04-15 18:34:48 model_runner.py:166] Loading model weights took 6.9976 GB
WARNING 04-15 18:34:48 fp8.py:20] FP8 hardware support doesn't exist for NVIDIA SM < 9.0. Up-conversion to original dtype will be used.
10 years ago, I was a young, naive, and inexperienced 20-year-old. I had just graduated from college and was about to embark on a new journey in my life. I was about to start my first job as a software engineer.
I was excited and nervous at the same time. I had never worked in a professional environment before, and I didn’t know what to expect. I had heard stories of long hours, difficult
"""