Can not reproduce speedup results reported in the paper.
Hi, thanks for you great work! I'm interested in the work and trying to reproduce results reported in the paper. However, Even though I used your open-source code model checkpoint, I still couldn't reproduce the results from the paper.
I downloaded yuhuili/EAGLE-LLaMA3-Instruct-8B and followed the instructions in the README to sequentially execute gen_baseline_answer_llama3chat.py, gen_ea_answer_llama3chat.py and speed.py on a A100-SXM4-80GB device.
Here is the speedup ratio I obtained.
speedup(report): 3.46x
our result: 2.634
I understand that the actual speedup ratio is related to the runtime environment, but the discrepancy between this result and the one reported in the paper is still too large. Moreover, I have read the discussion in #5 , and I am pretty sure that there were no other programs running on the GPU during my test.
Could you provide some suggestions for reproducing the results from the paper?