npu_plugin icon indicating copy to clipboard operation
npu_plugin copied to clipboard

Unable to Profile DMATask on NPU40XX

Open ColorsWind opened this issue 10 months ago • 10 comments

I am currently investigating the time breakdown of tasks within a single model. However, I've encountered an issue where I cannot profile the DMATask. I've spent a week trying to identify any missing information but have yet to resolve the problem. Below are the minimal steps to reproduce the issue. The platform is an Intel ultra258v laptop with openvino-nightly and numpy installed.

Step 1: Create the Sigmoid Model

import numpy as np
import openvino as ov
from openvino.runtime import op, opset1

n_len = 128
ov_type = ov.Type.f16

A = op.Parameter(ov_type, ov.Shape([n_len]))
C = opset1.sigmoid(A)
model = ov.Model(C, [A])
ov.save_model(model, "sigmoid.xml")

Step 2: Compile the Sigmoid Model with DMA Profiling Flags

vpux-translate --vpu-arch=NPU40XX \
    --vpux-profiling \
    --mlir-print-debuginfo \
    --import-IE sigmoid.xml -o sigmoid.mlir
    
vpux-opt --vpu-arch=NPU40XX \
    --default-hw-mode="profiling=true dma-profiling=true" \
    --lower-VPUIP-to-ELF sigmoid.mlir \
    -o sigmoid_out.mlir
vpux-translate --vpu-arch=NPU40XX --export-ELF sigmoid_out.mlir -o sigmoid.blob

Step 3: Run the Model

import os
os.environ['ZE_INTEL_NPU_LOGLEVEL'] = 'ERROR'

import openvino as ov
import numpy as np

core = ov.Core()
core.set_property('NPU', {
    'PERF_COUNT': True,
})
with open('sigmoid.blob', 'rb') as f:
    blob = f.read()
model = core.import_model(blob, device_name='NPU')
req = model.create_infer_request()
req.infer(np.random.random(128).astype(np.float16))
prof_info = req.profiling_info[0]
print('status', prof_info.status)
print('real_time', prof_info.real_time)
print('cpu_time', prof_info.cpu_time)
print('node_name', prof_info.node_name)
print('exec_type', prof_info.exec_type)
print('node_type', prof_info.node_type)

You will encounter the following error:

NPU_LOG: ERROR [compiler.cpp:289] Failed to get decoded profiling data in compiler

In fact, when I try using the Level Zero API to submit tasks directly and then use the Level Zero NPU extension API to get profiling results, I find that the DMA profiling output data is zero, while the DPU and SW-Kernel profiling outputs are valid. I am wondering how I can obtain the DMA task profiling results.

ColorsWind avatar Jan 06 '25 16:01 ColorsWind

I sent this issue to the OpenVINO repository. https://github.com/openvinotoolkit/openvino/issues/28285

ColorsWind avatar Jan 07 '25 02:01 ColorsWind

Hi, @ColorsWind. How do you get the vpux-translate. I tried to set the ov_option(ENABLE_DEVELOPER_BUILD "Enable developer build with extra validation/logging functionality" ON), but the vpux-translate is not found in the results. I built this repo by using cmake -B build -S . -DENABLE_NPU_COMPILER_BUILD=ON, following the instructions in https://github.com/intel/linux-npu-driver

Kepontry avatar Mar 28 '25 09:03 Kepontry

Hi @Kepontry. I can share my build script with you:

export OPENVINO_HOME=/home/user/openvino
export NPU_PLUGIN_HOME=/home/user/npu_compiler
cd $OPENVINO_HOME
mkdir build
cd build
cmake \
    -D CMAKE_VERBOSE_MAKEFILE=OFF \
    -D ENABLE_DEVELOPER_BUILD=ON \
    -D CMAKE_BUILD_TYPE=Release \
    -D ENABLE_MLIR_COMPILER=ON \
    -D BUILD_SHARED_LIBS=ON \
    -D OPENVINO_EXTRA_MODULES=$NPU_PLUGIN_HOME \
    -D ENABLE_LTO=ON \
    -D ENABLE_FASTER_BUILD=ON \
    -D ENABLE_CPPLINT=OFF \
    -D ENABLE_TESTS=OFF \
    -D ENABLE_FUNCTIONAL_TESTS=OFF \
    -D ENABLE_SAMPLES=OFF \
    -D ENABLE_JS=OFF \
    -D ENABLE_PYTHON=OFF \
    -D ENABLE_PYTHON_PACKAGING=OFF \
    -D ENABLE_WHEEL=OFF \
    -D ENABLE_OV_ONNX_FRONTEND=ON \
    -D ENABLE_OV_PYTORCH_FRONTEND=ON \
    -D ENABLE_OV_PADDLE_FRONTEND=OFF \
    -D ENABLE_OV_TF_FRONTEND=OFF \
    -D ENABLE_OV_TF_LITE_FRONTEND=OFF \
    -D ENABLE_OV_JAX_FRONTEND=OFF \
    -D ENABLE_OV_IR_FRONTEND=ON \
    -D THREADING=TBB \
    -D ENABLE_TBBBIND_2_5=OFF \
    -D ENABLE_SYSTEM_TBB=OFF \
    -D ENABLE_TBB_RELEASE_ONLY=OFF \
    -D ENABLE_HETERO=OFF \
    -D ENABLE_MULTI=OFF \
    -D ENABLE_AUTO=OFF \
    -D ENABLE_AUTO_BATCH=OFF \
    -D ENABLE_TEMPLATE=OFF \
    -D ENABLE_PROXY=OFF \
    -D ENABLE_INTEL_CPU=OFF \
    -D ENABLE_INTEL_GPU=OFF \
    -D ENABLE_NPU_PLUGIN_ENGINE=ON \
    -D ENABLE_ZEROAPI_BACKEND=OFF \
    -D ENABLE_DRIVER_COMPILER_ADAPTER=OFF \
    -D ENABLE_INTEL_NPU_INTERNAL=OFF \
    -D ENABLE_INTEL_NPU_PROTOPIPE=OFF \
    -D BUILD_COMPILER_FOR_DRIVER=ON \
    -D ENABLE_PRIVATE_TESTS=OFF \
    -D ENABLE_NPU_LSP_SERVER=ON \
    ..
cmake --build . --config Release --target vpux-opt vpux-translate -j16

You will find vpux-translate in $OPENVINO_HOME\bin.

ColorsWind avatar Mar 29 '25 11:03 ColorsWind

Hi, @ColorsWind . Thanks for your assistance, I found that the target vpux-translate depends on npu_translate_utils_static. However, the BUILD_SHARED_LIBS and ENABLE_MLIR_COMPILER flags should both be set to enable the target.

if(BUILD_SHARED_LIBS AND ENABLE_MLIR_COMPILER)
    add_subdirectory(vpux_translate_utils)
endif()

After manually setting BUILD_SHARED_LIBS=ON, I was able to build the vpux-translate. I also noticed that in the UD2025.12 version, the dependency is removed.

By the way, I also reproduced the issue on an Intel Ultra 258v system using OpenVINO built from the December 4th version.

I also opened an issue in the driver repo about the metrics used for profiling. It seems that only the first metric in the NOC metric groups works and can be used for DMA bandwidth monitoring. Btw, could you share what the DPU-related metrics mentioned earlier look like and how to collect them? I couldn't find any relevant documentation.

Kepontry avatar Mar 30 '25 08:03 Kepontry

@Kepontry

The hardware outputs a buffer as profiling output, with some bits related to the execution times of DPU, DMA, and SHAVE tasks. By parsing these bits, you can determine the start and end times of each task. The issue I discovered is that all bits related to DMA are zero, which seems to be a bug in either the hardware or firmware.

To collect this information, you can directly use the higher-level interfaces of OpenVINO. If you want to obtain more low-level information, you can use the Level Zero interface. However, I found that in the underlying buffer, all bits related to DMA are zero, meaning it is fundamentally impossible to obtain the execution times of DMA tasks. I spent a month on this issue and it seems unsolvable. Finally, I switched to the Ultra100 series hardware. It seems that this issue does not exist on some older platforms.

But if you are only concerned with hardware bandwidth, I recommend using some indirect methods. For example, you can construct a bandwidth-bound kernel (such as vector add) and record its execution time. Another method is to use VTune, which can collect some bandwidth-related information. (By the way, you need to use VTune on Windows; when I tested it on Linux two months ago, it did not support the functionality I mentioned earlier.)

ColorsWind avatar Mar 30 '25 09:03 ColorsWind

That helps a lot, thanks! @ColorsWind

Kepontry avatar Mar 30 '25 09:03 Kepontry

Hello @Kepontry @ColorsWind! We’ll do our best to bring back vpux-translate in the upcoming UD18 release, along with the relevant tests and documentation. @DariaMityagina could you please take a look at the issue reported in the sub?

Maxim-Doronin avatar Apr 11 '25 22:04 Maxim-Doronin

The issue should be resolved with the latest update https://github.com/openvinotoolkit/npu_compiler/releases/tag/npu_ud_2025_18_rc1 @ColorsWind @Kepontry could you please verify?

Maxim-Doronin avatar May 02 '25 09:05 Maxim-Doronin

I tried to compile the npu_ud_2025_18_rc1 of the NPU Compiler myself, and then used the command I described earlier to export the model loading.

vpux-translate --vpu-arch=NPU40XX \
    --vpux-profiling \
    --mlir-print-debuginfo \
    --import-IE sigmoid.xml -o sigmoid.mlir
    
vpux-opt --vpu-arch=NPU40XX \
    --default-hw-mode="profiling=true dma-profiling=true" \
    --lower-VPUIP-to-ELF sigmoid.mlir \
    -o sigmoid_out.mlir
vpux-translate --vpu-arch=NPU40XX --export-ELF sigmoid_out.mlir -o sigmoid.blob

I noticed that this issue still exists. The version I tested is Ubuntu 24.10, and I tested linux-npu-driver versions 1.16 and 1.17, with similar results.

NPU_LOG: *ERROR* [compiler.cpp:334] Failed to get decoded profiling data in compiler
NPU_LOG: *ERROR* [compiler.cpp:334] Failed to get decoded profiling data in compiler
NPU_LOG: *ERROR* [compiler.cpp:334] Failed to get decoded profiling data in compiler
NPU_LOG: *ERROR* [compiler.cpp:334] Failed to get decoded profiling data in compiler
Traceback (most recent call last):
  File "/home/wyk/npu_tools/run_model.py", line 29, in <module>
    prof_info = req.profiling_info[0]
RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:231:
Exception from src/plugins/intel_npu/src/backend/src/zero_profiling.cpp:70:
L0 pfnProfilingQueryGetData result: ZE_RESULT_ERROR_UNKNOWN, code 0x7ffffffe - an action is required to complete the desired operation

@Maxim-Doronin
Let me know if you need any further assistance!

ColorsWind avatar May 08 '25 09:05 ColorsWind

Hello @ColorsWind, unfortunately, the DMA profiling feature is not yet supported on LNL, and the option "dma-profiling" should be set to "false" by default. Ref: https://github.com/openvinotoolkit/npu_compiler/blob/32ea004638a65ec201cb8973a63662eb2d4f1617/src/vpux_compiler/include/vpux/compiler/NPU40XX/dialect/VPUIP/transforms/passes.hpp#L79

LeiChenIntel avatar May 30 '25 03:05 LeiChenIntel