ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

Support for Hybrid Acceleration on CPU, GPU, and NPU

Open Ursue opened this issue 11 months ago • 3 comments

Hello,

I am interested in whether ipex-llm currently supports hybrid acceleration across multiple hardware accelerators. Specifically, I would like to know if it is possible to assign different tasks to different compute units, for example:

Image processing assigned to GPU Embedding assigned to NPU Other tasks assigned to CPU

While trying to install libraries for both NPU and GPU, I encountered dependency conflicts. This makes it difficult to configure an environment where both accelerators can be used simultaneously.

Looking forward to your response. Thank you!

Ursue avatar Feb 10 '25 06:02 Ursue

Hello,

I am interested in whether ipex-llm currently supports hybrid acceleration across multiple hardware accelerators. Specifically, I would like to know if it is possible to assign different tasks to different compute units, for example:

Image processing assigned to GPU Embedding assigned to NPU Other tasks assigned to CPU

While trying to install libraries for both NPU and GPU, I encountered dependency conflicts. This makes it difficult to configure an environment where both accelerators can be used simultaneously.

Looking forward to your response. Thank you!

Hi @Ursue

A few GPU-CPU hybrid functionalities are already supported. For example, when cpu_embedding is True. The embedding layer will be assigned to CPU. Other layers will be transferred to GPU.

We also evaluated the NPU-CPU hybrid, but the result was not good enough.

The main blocking issues for hybrid solutions:

  1. IO across difference HW is extensive (except unified memory, i.e., Lunar Lake)
  2. Hbrid HW requires a complex scheduler and different configurations. Because difference HWs have difference throughput and latency.

qiyuangong avatar Feb 19 '25 03:02 qiyuangong

That's the problem! Both Intel and PC manufacturers are promoting AI ability (such as 99 TOPS, 115 TOPS ), however this number is actually the total of "CPU+NPU+GPU"! When you are using it, you can only choose one, so embarrassing...

baicl123 avatar Apr 17 '25 00:04 baicl123

That's the problem! Both Intel and PC manufacturers are promoting AI ability (such as 99 TOPS, 115 TOPS ), however this number is actually the total of "CPU+NPU+GPU"! When you are using it, you can only choose one, so embarrassing...

That's not a problem for most AI applications. :)

These AI PC platforms are designed with real-world AI applications in mind—like computer vision, speech recognition, and OCR—where workloads can be effectively split across devices. For example, the NPU might handle speech processing, the GPU runs lightweight vision models, and the CPU takes care of text-based tasks like OCR. Because these models are usually small, running them in parallel across devices works well and offers a good user experience.

When it comes to large language models (LLMs), the situation is different. LLMs are both compute- and memory-intensive. Splitting the workload across multiple devices often results in higher overhead and lower efficiency due to IO bottlenecks and device synchronization. In these cases, relying on a single high-throughput device like NPU/GPU is usually the better choice for performance.

qiyuangong avatar Apr 18 '25 14:04 qiyuangong