Use Dilated Attention as Core mechanism instead of vanilla Attention with Llama model

Open younesselbrag opened this issue 2 years ago • 0 comments

i want to ask if I can replace the Dilated attention with Attention used in the based model and do the fine-tuning, the idea behind this is to reduce the complexity of Attention and increase the Windows context, does DeepSeek use Llama 2 as a based model the same arch which means, I can load the Checkpoint of layers of the model such Normlayer and feedforward or I need to re-factor the LLM model from Scratch !! or there's any method to adapt weight or Shared Weight

Jan 05 '24 02:01 younesselbrag