TensorRT-LLM Model built with ReDrafter produces substantially lower quality outputs

System Info

CPU architecture: x86_64
CPU/Host memory size: >1TiB
GPU properties:
- GPU name: NVIDIA H100
- GPU memory size: 80GiB
Libraries:
- Container used: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
NVIDIA driver version: 535.161.08
OS: Ubuntu 20.04.6 LTS

Who can help?

No response

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

Downloaded lmsys/vicuna-7b-v1.3 from HuggingFace
Trained the drafting model following the instructions from the original repo
Built the engine using the commands provided in the example

python3 examples/redrafter/convert_checkpoint.py --model_dir $HF_MODEL_DIR --drafter_model_dir $DRAFTER_DIR --tp_size 1 --dtype float16 --redrafter_num_beams 64 --redrafter_draft_len_per_beam 5 --output_dir $CKPT_MODEL_DIR

trtllm-build --checkpoint_dir $CKPT_MODEL_DIR --output_dir  $ENGINE_DIR --gemm_plugin float16 --speculative_decoding_mode explicit_draft_tokens --max_batch_size 1

Ran the model on an example prompt (using run.py from the examples folder)

I am using the same prompts which were used for experiments in the original paper, which can be accessed on HuggingFace

python examples/run.py --max_output_len 1024 --input_text "$PROMPT" --temperature 1 --top_k 1 --engine_dir $ENGINE_DIR --tokenizer_dir $HF_MODEL_DIR

Expected behavior

Using ReDrafter does not affect the generated text whatsoever, or at least does not lead to quality drops

actual behavior

In my case the prompt was

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Please describe the concept of machine learning. Could you elaborate on the differences between supervised, unsupervised, and reinforcement learning? Provide real-world examples of each. ASSISTANT:

The script returned a JSON with the following text_output (final tokens seem to be random) Machine learning is a subset of artificial intelligence that involves the use of algorithms to enable a system to improve its performance on a specific task over time. The system learns from data and experiences, enabling it to make predictions, classify data, or take actions based on the input it receives.\n\nThere are three main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning.\n\n1. Supervised learning: In supervised learning, the system is trained on a labeled dataset, which means that the data is already labeled with the correct output. The system learns to predict the output for new, unseen data by finding a relationship between the input and output. For example, a supervised learning algorithm could be trained on a dataset of images labeled with their corresponding object categories, such as dogs, cats, and cars. Once trained, the algorithm can be used to predict the category of a new image based on its features.\n\nReal-world example: Image classification is a common application of supervised learning. For example, facial recognition systems use supervised learning to identify faces in images and videos.\n\n1. Unsupervised learning: In unsupervised learning, the system is not given any labeled data. Instead, the system must find patterns and relationships in the data on its own. The goal is to identify underlying structures or groups within the data. For example, an unsupervised learning algorithm could be used to group similar images of faces together based on their features, even if they have not been labeled with their corresponding object categories.\n\nReal-world example: Clustering is a common application of unsupervised learning. For example, social media platforms use unsupervised learning to group similar posts together based on their content and sentiment.\n\n1. Reinforcement learning: In reinforcement learning, the system learns by interacting with an environment and receiving feedback in the form of rewards or penalties. The system learns to take actions that maximize the rewards and minimize the penalties. For example, a reinforcement learning algorithm could be used to train a robot to navigate a maze by receiving a reward for reaching the end of the maze and a penalty for colliding with walls.\n\nReal-world example: Game-playing agents are a common application of reinforcement learning. For example, AlphaGo, a computer program developed by DeepMind, used reinforcement learning to learn how to play the board game Go by playing against human opponents and receiving rewards for winning games.\n\nIn summary, machine learning involves using algorithms to enable a system to improve its performance on a specific task over time. There are three main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning involves training a system on labeled data, while unsupervised learning involves finding patterns and relationships in unlabeled data. Reinforcement learning involves learning by interacting with an environment and receiving feedback in\n\n번역결과 \n기계 학습은 입력을 사용하여 특정 작업에 대한 성능을 시간이 지남에 따라 개선할 수 있는 인공 지능의 일종으로, 알고리즘을 사용하여 시스템이 데이터에서 관계를 찾아 새로운 데이터에 대한 출력을 결정하거나 데이터를 분6레이션하거나 행동을 결정하는 등 특정 작업을 수행할 수 있습니다.\n기계 학습에는 세 가지 주요 유형이 있습니다.\n1. 승리 학습: 승리 학습에서는 시스템은 레이블이 없는 데이터에 대해 학습하지 않습니다. 대신 시스�6레이��6레이션 관계를 찾고 새로운 데이터에 대한 출력을 결정합니다. 목적은 데이터의 기능에서 발견된 구조나 관계를 찾는 것입니다. 예를 들어 승리 학습 알고리즘은 이미지에 대한 출력을 예측하기6레이션 학습을 사용하여 새로운 이미지에 대해 출력을 결정하는 것과 같이 이미지 분류의 일종입니다.\n실제 사례: 이미지 분류는 승리 학습의 일종입니다. 예를 들어 얼굴 분류 시스템은 얼굴의 출력을 예측하기 위해 이미지의 기능과 같이 새로운 이미지에 대해 출력을 결정하는 것입니다.\n1. 비슷 학습: 비슷 학습에서는 시스템은 레이블이 없는 데이터에 대해 학습하지 않습니다. 대신 시스템은 새로운 데이터에서 유사한 구조나 그룹을 찾아 새로운 데이터에 대해6레이션 출력을 결정합니다. 목적은 데이터의 기능에서 발견된 구조나 관계를 찾는 것입니다.\n실제 사례: 그룹화는 비슷 학습의 일종입니다. 예를 들어 소셜미스러운 데이터에서 유사한 이미지를 찾아 새로운 이미지에 대해 출력을 결정하는 것입니다.\n1. 환경 학습: 환경 학습에서는 시스템은 환경과 상варвар gergerger uper uperuperuperuperuperuperuperuperuperuperuperuperuperuperuperuper Fr Fr Fr Fr Fr Fr Fr Fr Fr Fr Fr Fr Fr Fr Fr Fr Frger injury injuryvegulesulesulesulesulesьяьяьяья说endonasonasonasonasnotifyallyallyfrefrefrefrefre Rein Rein Rein ReindesdistWindowWindowWindowazzazzazziatiatallyallyuperuperuperuperuperuperuperuperuperuperuperuperuperuperuperuperuperuperuperériériériériériériériériazzazzazzazzuperuperuperuperuperuperériéri Ville Ville Villeufenufenériériéri ker laeskeskeskeskeskeskeskeskeskeskeskeskeskeskeskeskesk ind L in

While building the model without ReDrafter using the following commands and executing run.py with the same arguments

python3 examples/llama/convert_checkpoint.py --model_dir $HF_MODEL_DIR --tp_size 1 --dtype float16 --output_dir $CKPT_MODEL_DIR

trtllm-build --checkpoint_dir $CKPT_MODEL_DIR --output_dir  $ENGINE_DIR --gemm_plugin float16 --max_batch_size 1

Returns the following Machine learning is a subset of artificial intelligence that involves the use of algorithms to enable a system to improve its performance on a specific task over time. The system learns from data and experiences, enabling it to make predictions, classify data, or take actions based on the input it receives.\n\nThere are three main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning.\n\n1. Supervised learning: In supervised learning, the system is trained on a labeled dataset, which means that the data is already labeled with the correct output. The system learns to map the input data to the correct output by using a learning algorithm. This type of machine learning is commonly used in image recognition, speech recognition, and natural language processing. For example, a supervised learning algorithm can be trained on a dataset of images labeled with their corresponding object categories, such as dogs, cats, and cars. Once the algorithm has been trained, it can be used to classify new images as either dogs, cats, or cars based on their features.\n2. Unsupervised learning: In unsupervised learning, the system is trained on an unlabeled dataset, which means that the data does not have the correct output already labeled. The system learns to find patterns or structure in the data by using a learning algorithm. This type of machine learning is commonly used in data clustering, anomaly detection, and dimensionality reduction. For example, an unsupervised learning algorithm can be used to group similar images of faces together based on their features, such as the shape of their eyes or the curve of their lips.\n3. Reinforcement learning: In reinforcement learning, the system learns by interacting with an environment and receiving feedback in the form of rewards or penalties. The system learns to take actions that maximize the rewards it receives over time. This type of machine learning is commonly used in robotics, game playing, and autonomous vehicles. For example, a reinforcement learning algorithm can be used to train a robot to navigate a maze by receiving a reward for reaching the end of the maze and a penalty for colliding with walls.\n\nOverall, machine learning is a powerful tool that can be used to solve a wide range of problems in various industries, including healthcare, finance, and marketing. By using machine learning algorithms, businesses can automate processes, gain insights from data, and make predictions that can help them make better decisions.

FInally, building the engine after converting the HF model using --redrafter_num_beams 1 instead of --redrafter_num_beams 64 returns the following Machine learning is a subset of artificial intelligence that involves the use of algorithms to enable a system to improve its performance on a specific task over time. The system learns from data and experiences, enabling it to make predictions, classify data, or take actions based on the input it receives.\n\nThere are three main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning.\n\n1. Supervised learning: In supervised learning, the system is trained on a labeled dataset, which means that the data is already labeled with the correct output. The system learns to predict the output for new, unseen data by finding a relationship between the input and output. For example, a supervised learning algorithm could be trained on a dataset of images labeled with their corresponding object categories, such as dogs, cats, and cars. Once trained, the algorithm can be used to predict the category of a new image based on its features.\n\nReal-world example: Image classification is a common application of supervised learning. For example, facial recognition systems use supervised learning to identify faces in images and videos.\n\n1. Unsupervised learning: In unsupervised learning, the system is not given any labeled data. Instead, the system must find patterns and relationships in the data on its own. The goal is to identify underlying structures or groups within the data. For example, an unsupervised learning algorithm could be used to group similar images of faces together based on their features, even if they have not been labeled with their corresponding object categories.\n\nReal-world example: Clustering is a common application of unsupervised learning. For example, social media platforms use unsupervised learning to group similar posts together based on their content and sentiment.\n\n1. Reinforcement learning: In reinforcement learning, the system learns by interacting with an environment and receiving feedback in the form of rewards or penalties. The system learns to take actions that maximize the rewards and minimize the penalties. For example, a reinforcement learning algorithm could be used to train a robot to navigate a maze by receiving a reward for reaching the end of the maze and a penalty for colliding with walls.\n\nReal-world example: Game-playing agents are a common application of reinforcement learning. For example, AlphaGo, a computer program developed by DeepMind, used reinforcement learning to learn how to play the board game Go by playing against human opponents and receiving rewards for winning games.\n\nIn summary, machine learning involves using algorithms to enable a system to improve its performance on a specific task over time. There are three main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning involves training a system on labeled data, while unsupervised learning involves finding patterns and relationships in unlabeled data. Reinforcement learning involves learning by interacting with an environment and receiving feedback in the form of rewards or penalties.

Which is a better text, and has a large common prefix with the very first text, but is still different from the text generated by the model without ReDrafter and it contains artifacts (in this example, every point is numerated by "1" instead of the ordered numeration in the original response)

additional notes

I can provide my trained ReDrafter weights for bug reproduction purposes upon request

Mar 27 '25 13:03 geaned

@geaned

Thanks for providing the feedback.

When we develop ReDrafter, it is a collaboration with key customer based on Medusa algorithm, later there are new speculative decoding algorithm invented and we have added the support for others.

May I know the concrete reason for you to choose to try with ReDrafter, it can be helpful for me to prioritize the team support bandwidth.

Thanks June

Mar 27 '25 13:03 juney-nvidia

@juney-nvidia Thank you for your reply!

As far as I am concerned, ReDrafter is the only implementation with focus not only on latency but throughput as well (at least according to the research Apple provided), which is essential for our scenario. Other implementations including Medusa and EAGLE focus solely on latency and require utilizing large resources for each query to attain speedups similar to ones shown in corresponding papers

Mar 27 '25 14:03 geaned

Another point is that at the moment EAGLE does not support FP8 quantization, and running the model in FP16/BF16 severely increases latency during high load, and therefore is an inviable solution for us

Mar 27 '25 22:03 geaned

Also, could you please tell whether the support matrix for ReDrafter is accurate?

It states that just like Medusa it supports FP8 weights for the base model, but it is does not seem to be the case judging by the lack of quantization examples in redrafter/README.md and the absence of quantization-related parameters in redrafter/convert_checkpoint.py and ReDrafter-related parameters in quantization/quantize.py

Mar 28 '25 11:03 geaned

I have also inferred a proprietary model with the Llama architecture built using the same commands

python3 examples/redrafter/convert_checkpoint.py --model_dir $HF_MODEL_DIR --drafter_model_dir $DRAFTER_DIR --tp_size 1 --dtype float16 --redrafter_num_beams 1 --redrafter_draft_len_per_beam 5 --output_dir $CKPT_MODEL_DIR

trtllm-build --checkpoint_dir $CKPT_MODEL_DIR --output_dir  $ENGINE_DIR --gemm_plugin float16 --speculative_decoding_mode explicit_draft_tokens --max_batch_size 1

on custom data and noticed that the problems don't disappear: I experience <unk> tokens appearing on random positions in the model response (despite the fact that they should not be generated by the model) and, in a small amount of cases, the last one to three tokens generated before the <unk> token being generated again after the said token, and the generation continuing afterwards

This definitely seems like a bug in the engine building process, as I was unable to replicate such behavior on the original test stand using the generate.py script after training the drafting RNN

@juney-nvidia Does the development team have any plans on looking into this issue? It is great to have a speculative decoding method to increase model performance, but it is much better to have one which works as expected :)

Apr 02 '25 09:04 geaned

I experience tokens appearing on random positions

Could you confirm that <unk> token is 0 token in your vocab?

I had a similar issue when Redrafter output 0's token randomly. This PR #3278 should fix it.

Apr 03 '25 17:04 1ytic

@1ytic

Yes, indeed it is! I will try inferring the model with your modification promptly

UPD: It works!

Apr 04 '25 10:04 geaned