openvino icon indicating copy to clipboard operation
openvino copied to clipboard

[NPUW] Add Eagle3 (top-1) pipeline support with new I/O

Open GuoliangShiIntel opened this issue 2 months ago • 6 comments

Details:

This PR introduces support for Eagle3 (top-1) speculative decoding in the NPUW. The main changes include:

  1. Added a new llm_eagle3_extension module to handle Eagle3-specific input/output logic, including model role detection (Draft/Target), input padding, and chunked processing.
  2. Updated LLMInferRequest to automatically detect Eagle3 models and manage Eagle3 input/output tensors during prefill and generate stages.
  3. Modified model reshaping and output redirection functions to support new Eagle3 layer names and shapes.

These changes enable integration of Eagle3 (top-1) speculative decoding models with the NPU plugin.

Tickets:

image

GuoliangShiIntel avatar Oct 21 '25 03:10 GuoliangShiIntel

build_jenkins

AsyaPronina avatar Dec 01 '25 16:12 AsyaPronina

Hello Dear @GuoliangShiIntel ! Thanks a lot for such a great contribution!!

I left my comments below. One of my main concerns is fail-safe execution of Eagle3Extension. Should we explicitly forbid all code branches that don't fit Eagle3 requirements but marked as Eagle3 execution? I think fail-safe execution can hide issues in the future.

@AsyaPronina Thanks for your detailed review and valuable suggestions. I fully understand your concerns about fail-safe execution. I've added static assertions to all Eagle3Extension public functions to address this.

GuoliangShiIntel avatar Dec 02 '25 05:12 GuoliangShiIntel

build_jenkins

AsyaPronina avatar Dec 02 '25 13:12 AsyaPronina

@AsyaPronina @AlexanderKalistratov

Today, I aligned with the GPU pipeline and identified some model structure changes aimed at improving performance. Consequently, we will implement similar changes for the NPU pipeline, as we will be sharing the same model transformation.

The change are:

  1. Moving the fully convolutional layer from the draft model to the target model The target model will concatenate three hidden states along the last dimension and then use a fully connected layer to convert it back to the original shape. Previously, the target output shape was [1, token_len, 3*embedding_size]. Now, the output shape is [1, token_len, embedding_size].

  2. Remove the internal_hidden_states input from the draft model After the first change, the hidden_states input has the same shape as internal_hidden_states. Additionally, since only one of these inputs is activated at any stage while the other remains zero and they are summed together, we can maintain a single input and control which one is passed in through the pipeline.

The change in npuw with this commit

GuoliangShiIntel avatar Dec 05 '25 09:12 GuoliangShiIntel

Hi @AsyaPronina @AlexanderKalistratov I've addressed all review comments. What's the next step? Do we need additional reviews, or should we wait for the GenAI Pipeline PR to be ready before merging?

GuoliangShiIntel avatar Dec 10 '25 07:12 GuoliangShiIntel

build_jenkins

AsyaPronina avatar Dec 11 '25 12:12 AsyaPronina

build_jenkins

AsyaPronina avatar Dec 16 '25 12:12 AsyaPronina

build_jenkins

AsyaPronina avatar Dec 16 '25 23:12 AsyaPronina

Hello! Validation results are all successful! PR can be merged! However, I would like to ask you to make a follow-up PR where m_eagle3_ext.store_hidden_state_inputs(*this, inputs); in llm_infer_request.cpp will be also wrapped under if (m_eagle3_ext.is_eagle3_model()) condition. It is not required per code logic, but would be great to have all usages of Eagle3Extension in LLMInferRequest aligned!

AsyaPronina avatar Dec 17 '25 11:12 AsyaPronina