stm32ai-modelzoo Head_landmarks model inference on STM32N6570-DK much slower than reported (3.6 s vs 20 ms)

Head_landmarks model inference on STM32N6570-DK much slower than reported (3.6 s vs 20 ms)

Open tonyzzzzz opened this issue 1 month ago • 1 comments

I am currently running the head_landmarks model from the official STM32 Model Zoo: https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/pose_estimation/head_landmarks

The ONNX model I used is face_landmarks_v1_192_int8_pc.onnx (downloaded directly from the github. The model can be successfully executed on the STM32N6570-DK board using the NPU, and the output results are correct.

However, the inference speed is much slower than expected:

Actual inference time on N6570-DK: ~3.6 seconds per frame

Reported time in the Model Zoo README: ~20.52 milliseconds per frame

I would like to confirm:

Is there any specific optimization or configuration (e.g., memory placement, quantization format, build options, or runtime parameters) required to achieve the published 20 ms performance?
Could this large gap indicate that part of the model is running on the CPU instead of the NPU? I did a quick test, it looks like the NPU is invoked. But want to make sure whether there is an "official" method to determine whether the NPU is working.
Is there a way to check, from the generated ai_network_report or logs, which layers are accelerated by the NPU and which ones fall back to the CPU?

Any guidance or clarification on how to reproduce the official benchmark performance would be highly appreciated.

Oct 27 '25 02:10 tonyzzzzz

Hello @tonyzzzzz,

Could you share the configuration yaml file you are using please?

Guillaume

Nov 03 '25 13:11 GRATTINSTM

stm32ai-modelzoo stm32ai-modelzoo copied to clipboard

Head_landmarks model inference on STM32N6570-DK much slower than reported (3.6 s vs 20 ms)

stm32ai-modelzoo
stm32ai-modelzoo copied to clipboard