[NPUW] Add Eagle3 (top-1) pipeline support with new I/O
Details:
This PR introduces support for Eagle3 (top-1) speculative decoding in the NPUW. The main changes include:
- Added a new
llm_eagle3_extensionmodule to handle Eagle3-specific input/output logic, including model role detection (Draft/Target), input padding, and chunked processing. - Updated
LLMInferRequestto automatically detect Eagle3 models and manage Eagle3 input/output tensors during prefill and generate stages. - Modified model reshaping and output redirection functions to support new Eagle3 layer names and shapes.
These changes enable integration of Eagle3 (top-1) speculative decoding models with the NPU plugin.
Tickets:
build_jenkins
Hello Dear @GuoliangShiIntel ! Thanks a lot for such a great contribution!!
I left my comments below. One of my main concerns is fail-safe execution of
Eagle3Extension. Should we explicitly forbid all code branches that don't fitEagle3requirements but marked asEagle3execution? I think fail-safe execution can hide issues in the future.
@AsyaPronina Thanks for your detailed review and valuable suggestions. I fully understand your concerns about fail-safe execution. I've added static assertions to all Eagle3Extension public functions to address this.
build_jenkins
@AsyaPronina @AlexanderKalistratov
Today, I aligned with the GPU pipeline and identified some model structure changes aimed at improving performance. Consequently, we will implement similar changes for the NPU pipeline, as we will be sharing the same model transformation.
The change are:
-
Moving the fully convolutional layer from the draft model to the target model The target model will concatenate three hidden states along the last dimension and then use a fully connected layer to convert it back to the original shape. Previously, the target output shape was
[1, token_len, 3*embedding_size]. Now, the output shape is[1, token_len, embedding_size]. -
Remove the
internal_hidden_statesinput from the draft model After the first change, thehidden_statesinput has the same shape asinternal_hidden_states. Additionally, since only one of these inputs is activated at any stage while the other remains zero and they are summed together, we can maintain a single input and control which one is passed in through the pipeline.
The change in npuw with this commit
Hi @AsyaPronina @AlexanderKalistratov I've addressed all review comments. What's the next step? Do we need additional reviews, or should we wait for the GenAI Pipeline PR to be ready before merging?
build_jenkins
build_jenkins
build_jenkins
Hello! Validation results are all successful! PR can be merged!
However, I would like to ask you to make a follow-up PR where m_eagle3_ext.store_hidden_state_inputs(*this, inputs); in llm_infer_request.cpp will be also wrapped under if (m_eagle3_ext.is_eagle3_model()) condition. It is not required per code logic, but would be great to have all usages of Eagle3Extension in LLMInferRequest aligned!