vllm [Usage]: Acceptance rate for Speculative Decoding

I have been running the scripts from https://docs.vllm.ai/en/latest/models/spec_decode.html on how to do speculative decoding with vLLM.

However, it seems that the acceptance rate is not shown/outputted anywhere. Is there any way of computing it/accessing it?

Aug 08 '24 13:08 itsdaniele

Do you see any stats from the engine? You should see something like:

Speculative metrics: Draft acceptance rate: 0.607, System efficiency: 0.510, Number of speculative tokens: 4, Number of accepted tokens: 32594, Number of draft tokens: 53716, Number of emitted tokens: 34244.

Aug 08 '24 18:08 cadedaniel

Do you see any stats from the engine? You should see something like:

Speculative metrics: Draft acceptance rate: 0.607, System efficiency: 0.510, Number of speculative tokens: 4, Number of accepted tokens: 32594, Number of draft tokens: 53716, Number of emitted tokens: 34244.

Sorry I should have been more clear, I am trying to run this using offline inference as in the spec decoding tutorial.

My code is the following:

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="facebook/opt-6.7b",
    tensor_parallel_size=1,
    speculative_model="facebook/opt-125m",
    num_speculative_tokens=5,
    use_v2_block_manager=True,
)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

And I can't seem to find the acceptance rate anywhere in the output.

I have also tried running the OpenAI API example from the tutorial and then going to the /metrics endpoints, but I don't see the acceptance rate there either.

Aug 09 '24 09:08 itsdaniele

The acceptance rate stats will print every 5s, try this:

#!/usr/bin/env python3

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="facebook/opt-6.7b",
    tensor_parallel_size=1,
    speculative_model="facebook/opt-125m",
    num_speculative_tokens=5,
    use_v2_block_manager=True,
    disable_log_stats=False,
)

outputs = llm.generate(prompts, sampling_params)

import time
time.sleep(5)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

This can obviously be improved (maybe a flag to customize the interval?). LMK if you are interested in this and I can give you code pointers.

Aug 09 '24 20:08 cadedaniel

Thank you! This worked great.

Aug 13 '24 10:08 itsdaniele

This can obviously be improved (maybe a flag to customize the interval?). LMK if you are interested in this and I can give you code pointers.

Thanks, would be great to have some pointers.

Aug 13 '24 10:08 itsdaniele

Metrics are generated here https://github.com/vllm-project/vllm/blob/2ecf7b175703de020943b33532baaf6a31f69d3a/vllm/model_executor/layers/spec_decode_base_sampler.py#L125-L127
Copied to CPU periodically here https://github.com/vllm-project/vllm/blob/2ecf7b175703de020943b33532baaf6a31f69d3a/vllm/spec_decode/metrics.py#L82-L96
Copied to LLM engine here https://github.com/vllm-project/vllm/blob/2ecf7b175703de020943b33532baaf6a31f69d3a/vllm/spec_decode/spec_decode_worker.py#L741-L745 and https://github.com/vllm-project/vllm/blob/2ecf7b175703de020943b33532baaf6a31f69d3a/vllm/engine/llm_engine.py#L1483-L1489
Printed here https://github.com/vllm-project/vllm/blob/2ecf7b175703de020943b33532baaf6a31f69d3a/vllm/engine/metrics.py#L432-L441

The metrics are currently cumulative over the lifetime of the server.

Aug 14 '24 21:08 cadedaniel

@cadedaniel I still can't see the metrics of Acceptance rate , even if i have slept 5s or even more . My code is the following: here is my code

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="/root/autodl-tmp/Qwen-7B",
    tensor_parallel_size=1,
    speculative_model="/root/autodl-tmp/Qwen-1_8B",
    num_speculative_tokens=1,
    use_v2_block_manager=True,
    disable_log_stats=False,
    trust_remote_code=True,
    max_model_len = 2048
)

outputs = llm.generate(prompts, sampling_params)
import time

time.sleep(50)
print("after 50s later \n")

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

and here is the output in CLI


INFO 08-19 15:23:12 config.py:1450] Downcasting torch.float32 to torch.float16.
INFO 08-19 15:23:12 config.py:1450] Downcasting torch.float32 to torch.float16.
INFO 08-19 15:23:12 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/root/autodl-tmp/Qwen-7B', speculative_config=SpeculativeConfig(draft_model='/root/autodl-tmp/Qwen-1_8B', num_spec_tokens=1), tokenizer='/root/autodl-tmp/Qwen-7B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/root/autodl-tmp/Qwen-7B, use_v2_block_manager=True, enable_prefix_caching=False)
/root/miniconda3/envs/llama_index/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
WARNING 08-19 15:23:12 tokenizer.py:129] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 08-19 15:23:12 spec_decode_worker.py:156] Configuring SpecDecodeWorker with proposer=<class 'vllm.spec_decode.multi_step_worker.MultiStepWorker'>
INFO 08-19 15:23:12 spec_decode_worker.py:170] Configuring SpecDecodeWorker with sampler=<class 'vllm.model_executor.layers.rejection_sampler.RejectionSampler'>
INFO 08-19 15:23:13 model_runner.py:720] Starting to load model /root/autodl-tmp/Qwen-7B...
Loading safetensors checkpoint shards:   0% Completed | 0/8 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  12% Completed | 1/8 [00:00<00:03,  1.90it/s]
Loading safetensors checkpoint shards:  25% Completed | 2/8 [00:01<00:03,  1.72it/s]
Loading safetensors checkpoint shards:  38% Completed | 3/8 [00:01<00:02,  1.68it/s]
Loading safetensors checkpoint shards:  50% Completed | 4/8 [00:02<00:02,  1.68it/s]
Loading safetensors checkpoint shards:  62% Completed | 5/8 [00:02<00:01,  1.66it/s]
Loading safetensors checkpoint shards:  75% Completed | 6/8 [00:03<00:01,  1.67it/s]
Loading safetensors checkpoint shards:  88% Completed | 7/8 [00:03<00:00,  1.86it/s]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:04<00:00,  1.83it/s]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:04<00:00,  1.76it/s]

INFO 08-19 15:23:18 model_runner.py:732] Loading model weights took 14.3919 GB
INFO 08-19 15:23:18 model_runner.py:720] Starting to load model /root/autodl-tmp/Qwen-1_8B...
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  2.74it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.68it/s]

INFO 08-19 15:23:18 model_runner.py:732] Loading model weights took 3.4594 GB
INFO 08-19 15:23:19 gpu_executor.py:102] # GPU blocks: 190, # CPU blocks: 512
INFO 08-19 15:23:21 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-19 15:23:21 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-19 15:23:31 model_runner.py:1225] Graph capturing finished in 10 secs.
INFO 08-19 15:23:31 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-19 15:23:31 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-19 15:23:40 model_runner.py:1225] Graph capturing finished in 8 secs.
Processed prompts:   0%|                                                                                                                                                                                            | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 08-19 15:23:40 multi_step.py:57] Prompt logprob is not supported by multi step workers. (e.g., speculative decode uses multi step workers).
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.05it/s, est. speed input: 10.28 toks/s, output: Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.05it/s, est. speed input: 10.28 toks/s, output: 32.88 toks/s]
after 50s later 

Processed prompts:   0%|                                                                                                                                                                                            | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output:Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.73it/s, est. speed input: 18.67 toks/s, output: Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.73it/s, est. speed input: 18.67 toks/s, output: 59.75 toks/s]
Prompt: 'The future of AI is', Generated text: ' exciting, and I am thrilled to be a part of this journey. With the'

Aug 19 '24 07:08 xdtyjwj

Can you add a print here to verify that the acceptance rate metrics are being collected?

https://github.com/vllm-project/vllm/blob/c6af027a35b657b20ec60adac77cb75264b65a98/vllm/spec_decode/metrics.py#L84-L98

They should be printed here: https://github.com/vllm-project/vllm/blob/c6af027a35b657b20ec60adac77cb75264b65a98/vllm/engine/metrics.py#L386-L392

Aug 20 '24 18:08 cadedaniel

Can you add a print here to verify that the acceptance rate metrics are being collected?

https://github.com/vllm-project/vllm/blob/c6af027a35b657b20ec60adac77cb75264b65a98/vllm/spec_decode/metrics.py#L84-L98

They should be printed here:

https://github.com/vllm-project/vllm/blob/c6af027a35b657b20ec60adac77cb75264b65a98/vllm/engine/metrics.py#L386-L392

thanks ,I have solved this problem just by switching the vllm version to the latest

Aug 21 '24 09:08 xdtyjwj

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Nov 20 '24 02:11 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Dec 21 '24 01:12 github-actions[bot]

Hi @cadedaniel I wonder if we allow to update draft model weights when the VLLM sever is started?

Jan 16 '25 21:01 Neo9061

@cadedaniel Sorry another question, I am able to print out multiple speculative metrics like below. I suppose those are cumulative and hence I only need read the last speculative metrics for each example?

INFO 02-06 03:05:53 metrics.py:475] Speculative metrics: Draft acceptance rate: 0.200, System efficiency: 0.167, Number of speculative tokens: 5, Number of accepted tokens: 1, Number of draft tokens: 5, Number of emitted tokens: 1.
INFO 02-06 03:18:09 metrics.py:475] Speculative metrics: Draft acceptance rate: 0.067, System efficiency: 0.167, Number of speculative tokens: 5, Number of accepted tokens: 1, Number of draft tokens: 15, Number of emitted tokens: 3.

Feb 06 '25 03:02 Neo9061