MaixPy Custom model had very slow performance (fps)

Hi, I tried to run custom model and it run very slow, compared to YOLO. I tested with examples/vision/ai_vision/nn_forward.py and my model had forward time ~280ms compared to 11ms for YOLOv8n. But mine model is 4x time smaller. Actually I tried to run SuperPoint CDNN.

I've exported PyTorch model to ONNX, it runs well, on cpu it has same ~200ms forward time. There is model structure from Netron.app:

convert_model.sh

Then I used this script to quantize model to cvitek format and set output tensor to last convolution layers:

convert_model.sh

#!/bin/bash

set -e

net_name=superpoint_dynamic_simple
input_w=640  
input_h=480  

mkdir -p workspace
cd workspace

# convert to mlir
model_transform.py \
--model_name ${net_name} \
--model_def ../${net_name}.onnx \
--input_shapes [[1,1,${input_h},${input_w}]] \
--mean "0" \
--scale "0.00392156862745098" \
--keep_aspect_ratio \
--pixel_format gray \
--channel_format nchw \
--output_names "semi,/convDb/Conv_output_0" \
--test_input ../test_image.jpg \
--test_result ${net_name}_top_outputs.npz \
--tolerance 0.99,0.99 \
--mlir ${net_name}.mlir

# export bf16 model
#   not use --quant_input, use float32 for easy coding
model_deploy.py \
--mlir ${net_name}.mlir \
--quantize BF16 \
--processor cv181x \
--test_input ${net_name}_in_f32.npz \
--test_reference ${net_name}_top_outputs.npz \
--model ${net_name}_bf16.cvimodel

# export int8 model
echo "calibrate for int8 model"
run_calibration.py ${net_name}.mlir \
--dataset ../calibration_images \
--input_num 200 \
-o ${net_name}_cali_table

echo "convert to int8 model"
model_deploy.py \
--mlir ${net_name}.mlir \
--quantize INT8 \
--quant_input \
--calibration_table ${net_name}_cali_table \
--processor cv181x \
--test_input ${net_name}_in_f32.npz \
--test_reference ${net_name}_top_outputs.npz \
--tolerance 0.9,0.6 \
--model ${net_name}_int8.cvimodel

Although my model has only 18 nodes compared to 80 in yolov8n, it has enormous ION memory need - 46.7Mb (CviModel Need ION Memory Size: (46.68 MB)) compared to 4.4Mb for YOLO (CviModel Need ION Memory Size: (4.40 MB)).

Also that tensor map in resulting cvimodel looks odd, batch of Relu's, where in ONNX model has Conv→Relu→Conv→Relu structure.

I have read "cvitek tpu quick start guide" and tpumlir.org docs and didn't found any clue.

I definitely missing something, please help.

cvimodel_tool full dump

cvimodel_tool
Cvitek Runtime (1.4.0)t4.1.0-23-gb920beb@20230910
Mlir Version: v1.14-20241231
Cvimodel Version: 1.4.0
superpoint_dynamic_simple Build at 2025-01-19 03:07:23
For cv181x chip ONLY
CviModel Need ION Memory Size: (46.68 MB)

Sections:
ID   TYPE      NAME                     SIZE        OFFSET      ENCRYPT     COMPRESS    MD5
000  weight    weight                   1313680     0           False       False       22500857e07e66db361ac62bbc1b4780
001  cmdbuf    subfunc_0                1837776     1313680     False       False       d7f1d41bfa3e2e7f32e0035ca91e8639

WeightMap:
ID   OFFSET    SIZE      TYPE    N    C    H    W    NAME
000  467072    576       int8    1    64   1    9    /relu/Relu_output_0_Relu_bias_packed
001  902400    576       int8    1    64   9    1    /relu/Relu_output_0_Relu_filter_reordered
002  467648    576       int8    1    64   1    9    /relu_1/Relu_output_0_Relu_bias_packed
003  865536    36864     int8    1    64   9    64   /relu_1/Relu_output_0_Relu_filter_reordered
004  942160    576       int8    1    64   1    9    /relu_2/Relu_output_0_Relu_bias_packed
005  902976    36864     int8    1    64   9    64   /relu_2/Relu_output_0_Relu_filter_reordered
006  939840    576       int8    1    64   1    9    /relu_3/Relu_output_0_Relu_bias_packed
007  468224    36864     int8    1    64   9    64   /relu_3/Relu_output_0_Relu_filter_reordered
008  940416    1152      int8    1    128  1    9    /relu_4/Relu_output_0_Relu_bias_packed
009  942736    73728     int8    1    128  9    64   /relu_4/Relu_output_0_Relu_filter_reordered
010  1165072   1152      int8    1    128  1    9    /relu_5/Relu_output_0_Relu_bias_packed
011  1166224   147456    int8    1    128  9    128  /relu_5/Relu_output_0_Relu_filter_reordered
012  1016464   1152      int8    1    128  1    9    /relu_6/Relu_output_0_Relu_bias_packed
013  1017616   147456    int8    1    128  9    128  /relu_6/Relu_output_0_Relu_filter_reordered
014  465920    1152      int8    1    128  1    9    /relu_7/Relu_output_0_Relu_bias_packed
015  318464    147456    int8    1    128  9    128  /relu_7/Relu_output_0_Relu_filter_reordered
016  316160    2304      int8    1    256  1    9    /relu_8/Relu_output_0_Relu_bias_packed
017  21248     294912    int8    1    256  9    128  /relu_8/Relu_output_0_Relu_filter_reordered
018  941568    585       int8    1    65   1    9    semi_Conv_bias_packed
019  2304      16640     int8    1    65   1    256  semi_Conv_filter_reordered
020  0         2304      int8    1    256  1    9    /relu_9/Relu_output_0_Relu_bias_packed
021  570624    294912    int8    1    256  9    128  /relu_9/Relu_output_0_Relu_filter_reordered
022  18944     2304      int8    1    256  1    9    /convDb/Conv_output_0_Conv_bias_packed
023  505088    65536     int8    1    256  1    256  /convDb/Conv_output_0_Conv_filter_reordered

Program #0
    batch_num   : 0
    private_gmem_size: 0
    shared_gmem_size: 39321600
    inputs      : input
    outputs     : semi_Conv_f32,/convDb/Conv_output_0_Conv_f32
    routines    :
     #00  tpu
        inputs  : input
        outputs : semi_Conv_f32,/convDb/Conv_output_0_Conv_f32
        section : subfunc_0

    tensor_map  :
        ID   OFFSET      TYPE  N    C    H    W    QSCALE     MEM     NAME
        000  0           int8  1    1    480  640  127.000000 io_mem  input
        001  0           int8  1    64   480  640  0.339957   shared  /relu/Relu_output_0_Relu
        002  19660800    int8  1    64   480  640  0.165536   shared  /relu_1/Relu_output_0_Relu
        003  0           int8  1    64   240  320  0.165536   shared  /pool/MaxPool_output_0_MaxPool
        004  4915200     int8  1    64   240  320  0.231064   shared  /relu_2/Relu_output_0_Relu
        005  0           int8  1    64   240  320  0.269022   shared  /relu_3/Relu_output_0_Relu
        006  4915200     int8  1    64   120  160  0.269022   shared  /pool_1/MaxPool_output_0_MaxPool
        007  0           int8  1    128  120  160  0.167438   shared  /relu_4/Relu_output_0_Relu
        008  2457600     int8  1    128  120  160  0.154103   shared  /relu_5/Relu_output_0_Relu
        009  0           int8  1    128  60   80   0.154103   shared  /pool_2/MaxPool_output_0_MaxPool
        010  614400      int8  1    128  60   80   0.248690   shared  /relu_6/Relu_output_0_Relu
        011  0           int8  1    128  60   80   0.283347   shared  /relu_7/Relu_output_0_Relu
        012  614400      int8  1    256  60   80   0.033683   shared  /relu_8/Relu_output_0_Relu
        013  2457600     int8  1    65   60   80   0.335987   shared  semi_Conv
        014  1228800     int8  1    256  60   80   0.177987   shared  /relu_9/Relu_output_0_Relu
        015  0           int8  1    256  60   80   4.332494   shared  /convDb/Conv_output_0_Conv
        016  0           fp32  1    256  60   80   1.000000   io_mem  /convDb/Conv_output_0_Conv_f32
        017  0           fp32  1    65   60   80   1.000000   io_mem  semi_Conv_f32

Jan 19 '25 00:01 hellvesper

maybe bacause the input size? try change to smaller input size

Jan 20 '25 02:01 Neutree

maybe bacause the input size? try change to smaller input size

I tried YOLO11n 640x640 it has ~11ms. Is there any profiler tools I can use to investigate performance bottlenecks? I noticed that toolkit used to quantise and compile model for TPU use some TPU emulation but lacking documentation

Jan 20 '25 10:01 hellvesper

maybe you can change different output node to debug which node spend so much time

Jan 20 '25 12:01 Neutree

your model is simple, change different output node and export bf16 or int8 both fast, just try

Jan 20 '25 12:01 Neutree

and don't use --quant_input arg if you use MaixPy

Jan 20 '25 12:01 Neutree