intel-extension-for-transformers icon indicating copy to clipboard operation
intel-extension-for-transformers copied to clipboard

Integrate EAGLE with ITREX

Open siddhivelankar23 opened this issue 1 year ago • 2 comments
trafficstars

Type of Change

Added feature to use EAGLE (speculative sampling) with ITREX as discussed with the ITREX team and Haim Barad from my team. Added example script on how to use this feature Added README for instructions

API not changed

Description

Intel Extension for Transformers supports the EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) which is a speculative sampling method that improves text generation speed. Eagle repo used and research paper is included in the README

Expected Behavior & Potential Risk

Using the example_eagle.py script in the recommended way, output text and "tokens per second" will be shown in the output

How has this PR been tested?

Tested on Intel PVCs and CPUs

siddhivelankar23 avatar Apr 23 '24 06:04 siddhivelankar23

I don't have any questions about the PR, but would you mind if I ask several questions about the algorithm? I've only had a quick look at several papers in this domain. 1 Does the high accept rate bring the promising speedup? Based solely on the model structure, I anticipate Medusa should be a little faster. 2 There are lots of speed data in the paper, is there anyway to compare the accuracy or we could just take the accept rate as the accuracy. 3 Is the attention tree structure general to models, as medusa create the structure based on machine learning I guess, so they may diff from model to model.

wenhuach21 avatar Apr 28 '24 08:04 wenhuach21

Hello, I am Yuhui Li, the author of the EAGLE paper, and I am here to answer your question.

Does the high accept rate bring the promising speedup? Based solely on the model structure, I anticipate Medusa should be a little faster.

The acceptance rate determines how many tokens the target LLM generates before each forward pass. EAGLE's draft model is slower than Medusa, but the target LLM accepts more tokens each time, so the acceleration ratio is higher. Using the MT bench as the test dataset to speed up Vicuna 7B, EAGLE allows Vicuna 7B to accept an average of 3.86 tokens per forward, significantly higher than the 2.51 tokens when using Medusa. Considering that the target LLM (Vicuna 7B) is much larger than the draft model, the gain from a higher acceptance rate is enough to offset the cost of the slower draft model, making EAGLE about 1.5x faster than Medusa.

There are lots of speed data in the paper, is there anyway to compare the accuracy or we could just take the accept rate as the accuracy.

Of course. We can use the output of the target model as the label, and the draft model for classification. EAGLE's top-1 accuracy is about 0.8, while Medusa's top-1 accuracy is about 0.6.

Is the attention tree structure general to models, as medusa create the structure based on machine learning I guess, so they may diff from model to model.

Yes, the tree structure is general. Undoubtedly, using different tree structures for different models can achieve the best results, but EAGLE using a general tree structure already achieves quite good effects.

Liyuhui-12 avatar Apr 30 '24 15:04 Liyuhui-12