hongyinrong
hongyinrong
The Vicuna-13B-v1.3 model has a smaller vocabulary, and its head overhead is relatively small. Models like LLaMA3.1-Instruct 8B, LLaMA3.3-Instruct 70B, and DeepSeek-R1-Distill-LLaMA 8B have larger vocabularies, resulting in greater head...
for idx, decoder_layer in enumerate(self.layers): if idx==len(self.layers)-3 or idx==len(self.layers)//2 or idx==2: all_hidden_states += (hidden_states,) EAGLE/eagle/model /modeling_llama_kv.py: line 1138
Theoretically, EAGLE-3 can accomplish all of these tasks, but its acceleration performance might not be optimal. We have conducted further work based on EAGLE-3. By employing mathematical modeling and considering...
> Thank you for the response. If I were to experiment with the setups I mentioned, which method of doing so would you recommend? > > Also, the work you...
> > > Thank you for the response. If I were to experiment with the setups I mentioned, which method of doing so would you recommend? > > > Also,...