vllm Support FP8 KV Cache

Quantize KV Cache to fp8 can reducue memory usage of kv cache and then could boost throughput. The impl uses fp8 data type for kv cache and has been tested on V100, A100.

The following test is under WarzardCoder-34B.

Dataset	Baseline(KV Cache FP16)	KV Cache FP8 E5M2	KV Cache FP8 E4M3
HumanEval-Python-EN	68.293%	65.854% (↓ 2.439%)	67.683% (↓ 0.61%)
HumanEval-Python-CN	59.146%	59.146% (=)	59.756% (↑ 0.61%)

LLaMA-7B	Baseline(KV Cache FP16)	KV Cache FP8	Speedup
Offline throughput (tokens/sec)	1514.35	2265.89	1.49x

Usage:

    from vllm import LLM, SamplingParams
    # Sample prompts.
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    # Create an LLM.
    llm = LLM(model="facebook/opt-125m", kv_cache_dtype="fp8")
    # Generate texts from the prompts. The output is a list of RequestOutput objects
    # that contain the prompt, generated text, and other information.
    outputs = llm.generate(prompts, sampling_params)
    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Throughput: It will increase offline throughput as the memory for kv cache is doubled. If the on line requests are enough, it will also boost the online throughput.
Latency: It may increase the paged attention kernel as there are quantize/dequantize for cache, especially using fp8e4m3. So we use fp8e5m2 as defult.
Accuracy: We use HumanEval to evaluate the impact of fp8 and found that both e5m2 and e4m3 could be acceptable. In general, please use e4m3 if you want higher accuracy, but be aware that e4m3 will also make latency high as e4m3 may cost more cycles than e5m2 when casting from fp16/bf16/float.

Dec 27 '23 02:12 zhaoyang-star

LGTM, I was wondering about the performance improvement. And can we run the fp8 intrinsic on Volta/Ampere/Ada arch or is it just Hopper only?

Dec 27 '23 03:12 irasin

And I want to know which one should we use for better precision and performance between E5M2 and E4M3? I guess this may be related to the specific model.

Dec 27 '23 09:12 irasin

This seriously looks good. Is RTN used for the kv-cache quantization?

Dec 27 '23 11:12 casper-hansen

LGTM, I was wondering about the performance improvement. And can we run the fp8 intrinsic on Volta/Ampere/Ada arch or is it just Hopper only?

It is not limited on Hopper. Volta/Ampere are both ok and have bee tested. The fp8 intrinsic will directly use ASM to do data type conversion on Hopper while use bit operations on pre-Hopper.

Dec 29 '23 06:12 zhaoyang-star

RTN

RoundToNearest is not used in this impl. The impl uses cuda fp8 intrinsic, such as __nv_cvt_fp8_to_halfraw and __nv_cvt_bfloat16raw_to_fp8. I think cuda fp8 intrinsic is more general than RNT as it has been supported both on Hopper and pre-Hopper.

Dec 29 '23 06:12 zhaoyang-star

Below are tested on A100-40GB:

Offline throughput:

[fp8_cache]root@50c663527862:/zy/github/remote/vllm# python3 benchmarks/benchmark_throughput.py --input-len 1024 --output-len 1024 --model /models/huggingface/LLM/llama-7B-hf/ --tokenizer /zy/llama-tokenizer/
Namespace(backend='vllm', dataset=None, dtype='auto', enforce_eager=False, hf_max_batch_size=None, input_len=1024, max_model_len=None, model='/models/huggingface/LLM/llama-7B-hf/', n=1, num_prompts=1000, output_len=1024, quantization=None, seed=0, tensor_parallel_size=1, tokenizer='/zy/llama-tokenizer/', trust_remote_code=False, use_beam_search=False)
INFO 12-29 05:45:54 llm_engine.py:74] Initializing an LLM engine with config: model='/models/huggingface/LLM/llama-7B-hf/', tokenizer='/zy/llama-tokenizer/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, kv_cache_dtype=None, seed=0)
INFO 12-29 05:46:12 llm_engine.py:230] # GPU blocks: 2802, # CPU blocks: 512
INFO 12-29 05:46:17 model_runner.py:403] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-29 05:46:17 model_runner.py:407] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
INFO 12-29 05:46:31 model_runner.py:449] Graph capturing finished in 14 secs.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [22:08<00:00,  1.33s/it]
Throughput: 0.75 requests/s, 1541.35 tokens/s
[fp8_cache]root@50c663527862:/zy/github/remote/vllm# python3 benchmarks/benchmark_throughput.py --input-len 1024 --output-len 1024 --model /models/huggingface/LLM/llama-7B-hf/ --tokenizer /zy/llama-tokenizer/ --kv-cache-dtype="fp8"
Namespace(backend='vllm', dataset=None, dtype='auto', enforce_eager=False, hf_max_batch_size=None, input_len=1024, kv_cache_dtype='fp8', max_model_len=None, model='/models/huggingface/LLM/llama-7B-hf/', n=1, num_prompts=1000, output_len=1024, quantization=None, seed=0, tensor_parallel_size=1, tokenizer='/zy/llama-tokenizer/', trust_remote_code=False, use_beam_search=False)
INFO 12-29 06:16:00 llm_engine.py:74] Initializing an LLM engine with config: model='/models/huggingface/LLM/llama-7B-hf/', tokenizer='/zy/llama-tokenizer/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, kv_cache_dtype=torch.uint8, seed=0)
INFO 12-29 06:16:13 llm_engine.py:230] # GPU blocks: 5605, # CPU blocks: 1024
INFO 12-29 06:16:21 model_runner.py:403] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-29 06:16:21 model_runner.py:407] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
INFO 12-29 06:16:41 model_runner.py:449] Graph capturing finished in 20 secs.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [15:03<00:00,  1.11it/s]
Throughput: 1.11 requests/s, 2265.89 tokens/

Latency:

[fp8_cache]root@50c663527862:/zy/github/remote/vllm# python3 benchmarks/benchmark_latency.py --input-len 1024 --output-len 1024 --model /shared/models/huggingface/LLM/llama-7B-hf/ --tokenizer /zy/llama-tokenizer/
Namespace(batch_size=8, dtype='auto', enforce_eager=False, input_len=1024, kv_cache_dtype=None, model='/shared/models/huggingface/LLM/llama-7B-hf/', n=1, num_iters=3, output_len=1024, profile=False, profile_result_dir=None, quantization=None, tensor_parallel_size=1, tokenizer='/zy/llama-tokenizer/', trust_remote_code=False, use_beam_search=False)
INFO 12-29 07:01:41 llm_engine.py:74] Initializing an LLM engine with config: model='/shared/models/huggingface/LLM/llama-7B-hf/', tokenizer='/zy/llama-tokenizer/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, kv_cache_dtype=None, seed=0)
INFO 12-29 07:01:53 llm_engine.py:230] # GPU blocks: 2802, # CPU blocks: 512
INFO 12-29 07:01:55 model_runner.py:403] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-29 07:01:55 model_runner.py:407] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
INFO 12-29 07:02:01 model_runner.py:449] Graph capturing finished in 6 secs.
SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=1024, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True)
Warming up...
Profiling iterations: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:56<00:00, 18.78s/it]
Avg latency: 18.779154599333804 seconds
[fp8_cache]root@50c663527862:/zy/github/remote/vllm# python3 benchmarks/benchmark_latency.py --input-len 1024 --output-len 1024 --model /shared/models/huggingface/LLM/llama-7B-hf/ --tokenizer /zy/llama-tokenizer/ --kv-cache-dtype="fp8"
Namespace(batch_size=8, dtype='auto', enforce_eager=False, input_len=1024, kv_cache_dtype='fp8', model='/shared/models/huggingface/LLM/llama-7B-hf/', n=1, num_iters=3, output_len=1024, profile=False, profile_result_dir=None, quantization=None, tensor_parallel_size=1, tokenizer='/zy/llama-tokenizer/', trust_remote_code=False, use_beam_search=False)
INFO 12-29 07:13:48 llm_engine.py:74] Initializing an LLM engine with config: model='/shared/models/huggingface/LLM/llama-7B-hf/', tokenizer='/zy/llama-tokenizer/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, kv_cache_dtype=torch.uint8, seed=0)
INFO 12-29 07:13:55 llm_engine.py:230] # GPU blocks: 5605, # CPU blocks: 1024
INFO 12-29 07:13:57 model_runner.py:403] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-29 07:13:57 model_runner.py:407] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
INFO 12-29 07:14:02 model_runner.py:449] Graph capturing finished in 5 secs.
SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=1024, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True)
Warming up...
Profiling iterations: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:52<00:00, 17.37s/it]
Avg latency: 17.37384683514635 second

Dec 29 '23 06:12 zhaoyang-star

@WoosukKwon @zhuohan123 The PR is ready for review. Could you please take some time to review the code? Thanks a lot.

Dec 29 '23 07:12 zhaoyang-star

企业微信截图_17042664027999 got an error when test on this branch and add AT_DISPATCH_CASE(at::ScalarType::Byte, __VA_ARGS__) at csrc/dispatch_utils.h:12 may not a good way to fix this error

Jan 03 '24 07:01 seanxcwang

@seanxcwang Thanks for your feedback. We need to add torch.uint8 dtype for cache ops (copy, swap). I will fix it ASAP.

Jan 03 '24 08:01 zhaoyang-star

got an error when test on this branch and add AT_DISPATCH_CASE(at::ScalarType::Byte, __VA_ARGS__) at csrc/dispatch_utils.h:12 may not a good way to fix this error

Fixed. @seanxcwang could you please use the latest PR to test? Thanks again.

Jan 03 '24 14:01 zhaoyang-star

@zhaoyang-star have used new pr for testing，no other errors were found

Jan 04 '24 06:01 seanxcwang

@zhuohan123 @WoosukKwon The PR is ready for review. Could you please take time to review the code?

Jan 09 '24 00:01 zhaoyang-star

@zhuohan123 @WoosukKwon The PR is ready for review. Could you please take time to review the code?

I hope it can be merged, which is very useful for large models

Jan 09 '24 03:01 junior-zsy

@tjtanaa @hongxiayang We use CUDA Math API such as __nv_cvt_fp8_to_halfraw to do data type conversion. So I guess it will fail when running on AMD GPU. I think there are corresponding functions in hip. We could support it in the next PR.

Jan 10 '24 02:01 zhaoyang-star

E4M3 is the only common FP8 type used (and needed) during inference or model forward path, using E5M2 in forward is rare.

Thanks for your review.

The main reason E4M3 is not used is that E4M3 is much slower compared with E5M2 under pre-Hopper GPUs. For example, using benchmarks/benchmark_latency.py with --input-len 1024 --output-len 1024 on A100-40GB, E4M3 is about 70% slower than FP16! Because there are more bit operations when E4M3->half on pre-Hopper GPUs, while only one assembly instruction cvt.rn.f16x2.e4m3x2 on Hopper GPUs. So I made E5M2 as the defaut fp8 data type.

LLaMA-7B	Baseline(KV Cache FP16)	KV Cache FP8-E5M2	KV Cache FP8-E4M3
Latency (sec)	18.78	17.37	31.77

Yes, E4M3 (data range is [-447., 448.]) does need scaled param to avoid accuracy loss. E5M2 could no need of scale param.

Jan 10 '24 09:01 zhaoyang-star

My concern about this PR is that it will incur performance, compatibility and interop issues when compare with FP8 serving/inference by nVIDIA and AMD, etc. solutions. None of them is using E5M2 for the inference or forward path.

FYI - Faster Transformer has this feature, it is good to use as a reference.

Jan 11 '24 18:01 HaiShaw

My concern about this PR is that it will incur performance, compatibility and interop issues when compare with FP8 serving/inference by nVIDIA and AMD, etc. solutions. None of them is using E5M2 for the inference or forward path.

FYI - Faster Transformer has this feature, it is good to use as a reference.

Thanks for your comment. The main reason choosing e5m2 is that e5m2 has both acceptable accuracy loss and better latency/throughput.

As shown in the above discussion, e4m3 has too much higher latency (even larger than fp16 baseline) due to its conversion efficiency on pre-Hopper GPUs. Similar discussion can be found. The latency is so high that it is unacceptable for real case. Unfortunately I am not available to H100, I guess the latency of the two data types may be close on Hopper GPUs.

Jan 12 '24 07:01 zhaoyang-star

@zhaoyang-star Good that you noticed my concern, IMO I tend to reject this idea to use E5M2 without scaling (from/to wider precision numbers: fp16/fp32/etc.) , but I will try to be open to provide a bit more explanation in full picture:

Both E5M2 and E4M3 have much smaller resolution than FP16 as we already know, but even E5M2 has narrower range than FP16 if we consider subnormal values, regardless they both have 5bit Exponents.
For that reason, simply quantize wider float numbers to E5M2 with saturation to E5M2's FMAX isn't strictly good enough (E4M3 is the same), and it isn't a proved technical route, neither we are assured a bounded guarantee for other models or algorithms. That is why we use scaling factors (or inverse scaling factors) to quantize wider numbers to narrower numbers (including FP8: E4M3/E5M2) for ML numerical at industry, this includes AMD and nVIDIA (check Faster Transformer or TensorRT-LLM), etc.
For LLM, when GEMM compute is conducted with FP8 in TensorCore, scaling can done at whole Tensor level, or per channel level (per hidden feature column or per sequence) or per block/group level, but per tensor scaling is the main stream required. When FP8 TC compute isn't available, or where it is yet to be optimally considered (e.g. within MHA/MQA), computation can be done in TC with FP16/etc., then inference engine would normally do dequant on-the-fly before calling to instructions.

Hope this helpful to an extend.

Jan 12 '24 08:01 HaiShaw

@HaiShaw Your concern is very important to the feature. I think we could consider adding scaling factor in the next PR. I insist on using e5m2. e4m3 will lead to a significant increase in latency. I think one of the most important reasons that e4m3 used in trt-llm fp8 kv cache is that fp8-e4m3 is suggested to be used on Hopper. see fp8-hopper in trt-llm.

Jan 12 '24 09:01 zhaoyang-star

@HaiShaw Your concern is very important to the feature. I think we could consider adding scaling factor in the next PR. I insist on using e5m2. e4m3 will lead to a significant increase in latency. I think one of the most important reasons that e4m3 used in trt-llm fp8 kv cache is that fp8-e4m3 is suggested to be used on Hopper. see fp8-hopper in trt-llm.

@zhaoyang-star, using e4m3 in forward path, e5m2 for gradient in backward path has long history back to IBM's HFP8 proposal at 2018~2019 time frame, nVIDIA used the same scheme in Transformer Engine since Hopper release, so as other vendors with their libraries support on their HW (and all come with scaling vs. saturate without scaling). I realize that this PR may not be feasible to address altogether, so follow-up ones seem to be appropriate. For current one @zhuohan123 's suggestion looks good to me too - you can namespace the feature with fp8_e5m2 or fp8_e5m2_unscaled throughout, so it is explicit and specific to users.

Jan 13 '24 07:01 HaiShaw

@HaiShaw Your concern is very important to the feature. I think we could consider adding scaling factor in the next PR. I insist on using e5m2. e4m3 will lead to a significant increase in latency. I think one of the most important reasons that e4m3 used in trt-llm fp8 kv cache is that fp8-e4m3 is suggested to be used on Hopper. see fp8-hopper in trt-llm.

@zhaoyang-star, using e4m3 in forward path, e5m2 for gradient in backward path has long history back to IBM's HFP8 proposal at 2018~2019 time frame, nVIDIA used the same scheme in Transformer Engine since Hopper release, so as other vendors with their libraries support on their HW (and all come with scaling vs. saturate without scaling). I realize that this PR may not be feasible to address altogether, so follow-up ones seem to be appropriate. For current one @zhuohan123 's suggestion looks good to me too - you can namespace the feature with fp8_e5m2 or fp8_e5m2_unscaled throughout, so it is explicit and specific to users.

I totally agree with you. I have added fp8_e5m2_unscaled namespace to make it explicit.

Jan 13 '24 09:01 zhaoyang-star

Given that there is also https://github.com/vllm-project/vllm/pull/1507 for int8, it would be good to give a little thought to the convention going forward, here is a possibility:

Set --kv-cache-dtype=fp8_e5m2 for E5M2 (e.g. for A100 and if you are less sensitive about the accuracy), set --kv-cache-dtype=fp8_e4m3 for E4M3 (e.g. for H100 in all cases or on A100 if you are more sensitive about the accuracy and less about the performance) and I have a feeling given this seems to be the industry standard for forward passes for LLMs in fp8 this could be aliased to fp8, set --kv-cache-dtype=int8 for int8 quantization.

After https://github.com/vllm-project/vllm/pull/1507 is merged I would expect all of these can be combined with --kv-quant-params-path for scaling if needed (or whichever other convention will be used for scaling).

Is all this complexity a good idea? Code wise these flags can probably be supported without duplication / special cases with the right code structure. From the user perspective ideally we will live in a world in the future where E4M3 is common place and the user only needs to think about whether they want fp8 or not. But in the mean time A100s are common of course.

Jan 15 '24 01:01 pcmoritz

Thanks for all comments. The PR is ready for review again. cc @zhuohan123 @HaiShaw As fp8 feature is more complex than I expected, we will seperate it into two or three PRs. This is the first PR. The next pr could cover:

FP8 with scaling factor
Porting FP8 KV Cache to AMD GPU

Jan 16 '24 01:01 zhaoyang-star

@zhaoyang-star Is there precision problem convert from bfloat16 to fp8? because exponent is not same.

Jan 16 '24 06:01 Shawn314

@Shawn314 I have tested the feature on bfloat16/half/float model and the accuracy loss is acceptable. We use API __nv_cvt_bfloat16raw_to_fp8 in cuda_fp8.h to do bf16 -> fp8.

Jan 16 '24 07:01 zhaoyang-star

@zhaoyang-star Is there precision problem convert from bfloat16 to fp8? because exponent is not same.

@Shawn314 cuda intrinsic along would just cast wider precision numbers up to FP8 (e4m3 or e5m2) range as it is (with resolution loss), values out of target range would just be saturated to FP8 MAXNORM or NaN, intrinsic doesn't just do bitmask based trim or keep. For bfloat16, not only number of exponent bits is different (8 vs. 5 in float16), it is exponent-bias is also different (127 vs. 15 for float16). So precision problem exists, even for source bfloat16 value fall below FP8 MAXNORM (see previous discussion on precision loss for fp16 => e5m2 without scaling). To address this in general method, existing practices introduce a scaling factor S (take per-Tensor scaling for example), which is to assure ScaledSource=SourceTensor*S occupy the full FP8 value range, then to_fp8_cast(ScaledSource) is applied. Thanks!

Jan 16 '24 07:01 HaiShaw

@zhaoyang-star @HaiShaw Thanks your explanation, very clearly!

Jan 17 '24 03:01 Shawn314

@zhaoyang-star @HaiShaw Thanks your explanation, very clearly!

@Shawn314 @zhaoyang-star I just opened a FP8 discussion below, comments are welcome! https://github.com/vllm-project/vllm/discussions/2461

Jan 17 '24 06:01 HaiShaw

I test benchmark_throughput.py but got this error. @zhaoyang-star

 python3 benchmarks/benchmark_throughput.py --input-len 1024 --output-len 1024 --model /mnt/infra/fangxiao/models/llama13Bhuggingface/ --tokenizer /mnt/infra/fangxiao/models/llama13Bhuggingface/ --kv-cache-dtype="fp8" --num-prompts=200
Namespace(backend='vllm', dataset=None, input_len=1024, output_len=1024, model='/mnt/infra/fangxiao/models/llama13Bhuggingface/', tokenizer='/mnt/infra/fangxiao/models/llama13Bhuggingface/', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=200, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=False, kv_cache_dtype='fp8')
INFO 01-18 07:33:33 config.py:296] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. But it may make slight accuray drop. Currently we only support fp8 without scaling factors and make e5m2 as a default format.
INFO 01-18 07:33:33 llm_engine.py:70] Initializing an LLM engine with config: model='/mnt/infra/fangxiao/models/llama13Bhuggingface/', tokenizer='/mnt/infra/fangxiao/models/llama13Bhuggingface/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, kv_cache_dtype=torch.uint8, seed=0)
INFO 01-18 07:33:50 llm_engine.py:299] # GPU blocks: 7532, # CPU blocks: 655
INFO 01-18 07:33:52 model_runner.py:512] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-18 07:33:52 model_runner.py:516] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-18 07:33:55 model_runner.py:567] Graph capturing finished in 4 secs.
Processed prompts:   0%|                                                                                                                                                                                                                     | 0/200 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/mnt/infra/fangxiao/github/fp8-vllm/benchmarks/benchmark_throughput.py", line 328, in <module>
    main(args)
  File "/mnt/infra/fangxiao/github/fp8-vllm/benchmarks/benchmark_throughput.py", line 207, in main
    elapsed_time = run_vllm(requests, args.model, args.tokenizer,
  File "/mnt/infra/fangxiao/github/fp8-vllm/benchmarks/benchmark_throughput.py", line 109, in run_vllm
    llm._run_engine(use_tqdm=True)
  File "/mnt/infra/fangxiao/github/fp8-vllm/vllm/entrypoints/llm.py", line 185, in _run_engine
    step_outputs = self.llm_engine.step()
  File "/mnt/infra/fangxiao/github/fp8-vllm/vllm/engine/llm_engine.py", line 731, in step
    all_outputs = self._run_workers(
  File "/mnt/infra/fangxiao/github/fp8-vllm/vllm/engine/llm_engine.py", line 901, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/infra/fangxiao/github/fp8-vllm/vllm/worker/worker.py", line 200, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/infra/fangxiao/github/fp8-vllm/vllm/worker/model_runner.py", line 472, in execute_model
    output = self.model.sample(
  File "/mnt/infra/fangxiao/github/fp8-vllm/vllm/model_executor/models/llama.py", line 295, in sample
    next_tokens = self.sampler(self.lm_head.weight, hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/infra/fangxiao/github/fp8-vllm/vllm/model_executor/layers/sampler.py", line 63, in forward
    do_min_p) = SamplingTensors.from_sampling_metadata(
  File "/mnt/infra/fangxiao/github/fp8-vllm/vllm/model_executor/sampling_metadata.py", line 137, in from_sampling_metadata
    sampling_tensors = SamplingTensors.from_lists(
  File "/mnt/infra/fangxiao/github/fp8-vllm/vllm/model_executor/sampling_metadata.py", line 167, in from_lists
    temperatures_t = torch.tensor(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Processed prompts:   0%|

Jan 18 '24 07:01 Shawn314

@Shawn314 I tested it and found no error. Could you pull the latest PR?

[fp8_cache]root@50c663527862:/bigdata/zhaoyang/github/remote/vllm# python3 benchmarks/benchmark_throughput.py --input-len 1024 --output-len 1024 --model /bigdata/shared/models/huggingface/LLM/llama-13b-hf/ --tokenizer /bigdata/zhaoyang/llama-tokenizer/ --kv-cache-dtype fp8 --num-prompts 200
Namespace(backend='vllm', dataset=None, dtype='auto', enforce_eager=False, hf_max_batch_size=None, input_len=1024, kv_cache_dtype='fp8', max_model_len=None, model='/bigdata/shared/models/huggingface/LLM/llama-13b-hf/', n=1, num_prompts=200, output_len=1024, quantization=None, seed=0, tensor_parallel_size=1, tokenizer='/bigdata/zhaoyang/llama-tokenizer/', trust_remote_code=False, use_beam_search=False)
INFO 01-18 11:24:33 config.py:296] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. But it may make slight accuray drop. Currently we only support fp8 without scaling factors and make e5m2 as a default format.
INFO 01-18 11:24:34 llm_engine.py:70] Initializing an LLM engine with config: model='/bigdata/shared/models/huggingface/LLM/llama-13b-hf/', tokenizer='/bigdata/zhaoyang/llama-tokenizer/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, kv_cache_dtype=torch.uint8, seed=0)
INFO 01-18 11:26:43 llm_engine.py:299] # GPU blocks: 1656, # CPU blocks: 655
INFO 01-18 11:26:51 model_runner.py:512] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-18 11:26:51 model_runner.py:516] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-18 11:27:03 model_runner.py:567] Graph capturing finished in 12 secs.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [08:20<00:00,  2.50s/it]
Throughput: 0.40 requests/s, 818.69 tokens/s

Jan 18 '24 12:01 zhaoyang-star

vllm vllm copied to clipboard

Support FP8 KV Cache

vllm
vllm copied to clipboard