nm-vllm icon indicating copy to clipboard operation
nm-vllm copied to clipboard

Prototype FP8Linear W8A8 runtime quantization

Open mgoin opened this issue 2 years ago • 0 comments

Adds FP8 quantization at runtime for both weights and activations using torch.float8_e4m3fn

torch._scaled_mm provides an W8A8 linear kernel for FP8, but is only supported on CUDA devices with compute capability >= 9.0 for torch==2.2.1.

RuntimeError: torch._scaled_mm is only supported on devices with compute capability >= 9.0)

It has been expanded to CUDA 8.9, or ROCm MI300+ on main, but won't be on a stable release for a while.

This means for CUDA devices with compute capability < 9.0 (currently everything below Hopper), the weights will be dequantized into higher precision offering no compute savings.

Original precision bfloat16:

from vllm import LLM, SamplingParams

model = LLM("teknium/OpenHermes-2.5-Mistral-7B", enforce_eager=True)
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Once upon a time, ", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

"""
INFO 04-15 18:29:45 model_runner.py:166] Loading model weights took 13.4976 GB

10 years ago, I was a young, naive, and inexperienced 20-year-old. I had just graduated from college and was about to embark on my first job as a software engineer. I was excited, nervous, and scared all at the same time. I had no idea what to expect, but I was ready to take on the world.
"""

Quantized to FP8, specifically float8_e4m3fn:

from vllm import LLM, SamplingParams

model = LLM("teknium/OpenHermes-2.5-Mistral-7B", enforce_eager=True, quantization="fp8")
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Once upon a time, ", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

"""
INFO 04-15 18:34:48 model_runner.py:166] Loading model weights took 6.9976 GB
WARNING 04-15 18:34:48 fp8.py:20] FP8 hardware support doesn't exist for NVIDIA SM < 9.0. Up-conversion to original dtype will be used.

10 years ago, I was a young, naive, and inexperienced 20-year-old. I had just graduated from college and was about to embark on a new journey in my life. I was about to start my first job as a software engineer.

I was excited and nervous at the same time. I had never worked in a professional environment before, and I didn’t know what to expect. I had heard stories of long hours, difficult
"""

mgoin avatar Apr 15 '24 16:04 mgoin