Custom model had very slow performance (fps)
Hi, I tried to run custom model and it run very slow, compared to YOLO. I tested with examples/vision/ai_vision/nn_forward.py and my model had forward time ~280ms compared to 11ms for YOLOv8n. But mine model is 4x time smaller. Actually I tried to run SuperPoint CDNN.
I've exported PyTorch model to ONNX, it runs well, on cpu it has same ~200ms forward time. There is model structure from Netron.app:
convert_model.sh
Then I used this script to quantize model to cvitek format and set output tensor to last convolution layers:
convert_model.sh
#!/bin/bash
set -e
net_name=superpoint_dynamic_simple
input_w=640
input_h=480
mkdir -p workspace
cd workspace
# convert to mlir
model_transform.py \
--model_name ${net_name} \
--model_def ../${net_name}.onnx \
--input_shapes [[1,1,${input_h},${input_w}]] \
--mean "0" \
--scale "0.00392156862745098" \
--keep_aspect_ratio \
--pixel_format gray \
--channel_format nchw \
--output_names "semi,/convDb/Conv_output_0" \
--test_input ../test_image.jpg \
--test_result ${net_name}_top_outputs.npz \
--tolerance 0.99,0.99 \
--mlir ${net_name}.mlir
# export bf16 model
# not use --quant_input, use float32 for easy coding
model_deploy.py \
--mlir ${net_name}.mlir \
--quantize BF16 \
--processor cv181x \
--test_input ${net_name}_in_f32.npz \
--test_reference ${net_name}_top_outputs.npz \
--model ${net_name}_bf16.cvimodel
# export int8 model
echo "calibrate for int8 model"
run_calibration.py ${net_name}.mlir \
--dataset ../calibration_images \
--input_num 200 \
-o ${net_name}_cali_table
echo "convert to int8 model"
model_deploy.py \
--mlir ${net_name}.mlir \
--quantize INT8 \
--quant_input \
--calibration_table ${net_name}_cali_table \
--processor cv181x \
--test_input ${net_name}_in_f32.npz \
--test_reference ${net_name}_top_outputs.npz \
--tolerance 0.9,0.6 \
--model ${net_name}_int8.cvimodel
Although my model has only 18 nodes compared to 80 in yolov8n, it has enormous ION memory need - 46.7Mb (CviModel Need ION Memory Size: (46.68 MB)) compared to 4.4Mb for YOLO (CviModel Need ION Memory Size: (4.40 MB)).
Also that tensor map in resulting cvimodel looks odd, batch of Relu's, where in ONNX model has Conv→Relu→Conv→Relu structure.
I have read "cvitek tpu quick start guide" and tpumlir.org docs and didn't found any clue.
I definitely missing something, please help.
cvimodel_tool full dump
cvimodel_tool
Cvitek Runtime (1.4.0)t4.1.0-23-gb920beb@20230910
Mlir Version: v1.14-20241231
Cvimodel Version: 1.4.0
superpoint_dynamic_simple Build at 2025-01-19 03:07:23
For cv181x chip ONLY
CviModel Need ION Memory Size: (46.68 MB)
Sections:
ID TYPE NAME SIZE OFFSET ENCRYPT COMPRESS MD5
000 weight weight 1313680 0 False False 22500857e07e66db361ac62bbc1b4780
001 cmdbuf subfunc_0 1837776 1313680 False False d7f1d41bfa3e2e7f32e0035ca91e8639
WeightMap:
ID OFFSET SIZE TYPE N C H W NAME
000 467072 576 int8 1 64 1 9 /relu/Relu_output_0_Relu_bias_packed
001 902400 576 int8 1 64 9 1 /relu/Relu_output_0_Relu_filter_reordered
002 467648 576 int8 1 64 1 9 /relu_1/Relu_output_0_Relu_bias_packed
003 865536 36864 int8 1 64 9 64 /relu_1/Relu_output_0_Relu_filter_reordered
004 942160 576 int8 1 64 1 9 /relu_2/Relu_output_0_Relu_bias_packed
005 902976 36864 int8 1 64 9 64 /relu_2/Relu_output_0_Relu_filter_reordered
006 939840 576 int8 1 64 1 9 /relu_3/Relu_output_0_Relu_bias_packed
007 468224 36864 int8 1 64 9 64 /relu_3/Relu_output_0_Relu_filter_reordered
008 940416 1152 int8 1 128 1 9 /relu_4/Relu_output_0_Relu_bias_packed
009 942736 73728 int8 1 128 9 64 /relu_4/Relu_output_0_Relu_filter_reordered
010 1165072 1152 int8 1 128 1 9 /relu_5/Relu_output_0_Relu_bias_packed
011 1166224 147456 int8 1 128 9 128 /relu_5/Relu_output_0_Relu_filter_reordered
012 1016464 1152 int8 1 128 1 9 /relu_6/Relu_output_0_Relu_bias_packed
013 1017616 147456 int8 1 128 9 128 /relu_6/Relu_output_0_Relu_filter_reordered
014 465920 1152 int8 1 128 1 9 /relu_7/Relu_output_0_Relu_bias_packed
015 318464 147456 int8 1 128 9 128 /relu_7/Relu_output_0_Relu_filter_reordered
016 316160 2304 int8 1 256 1 9 /relu_8/Relu_output_0_Relu_bias_packed
017 21248 294912 int8 1 256 9 128 /relu_8/Relu_output_0_Relu_filter_reordered
018 941568 585 int8 1 65 1 9 semi_Conv_bias_packed
019 2304 16640 int8 1 65 1 256 semi_Conv_filter_reordered
020 0 2304 int8 1 256 1 9 /relu_9/Relu_output_0_Relu_bias_packed
021 570624 294912 int8 1 256 9 128 /relu_9/Relu_output_0_Relu_filter_reordered
022 18944 2304 int8 1 256 1 9 /convDb/Conv_output_0_Conv_bias_packed
023 505088 65536 int8 1 256 1 256 /convDb/Conv_output_0_Conv_filter_reordered
Program #0
batch_num : 0
private_gmem_size: 0
shared_gmem_size: 39321600
inputs : input
outputs : semi_Conv_f32,/convDb/Conv_output_0_Conv_f32
routines :
#00 tpu
inputs : input
outputs : semi_Conv_f32,/convDb/Conv_output_0_Conv_f32
section : subfunc_0
tensor_map :
ID OFFSET TYPE N C H W QSCALE MEM NAME
000 0 int8 1 1 480 640 127.000000 io_mem input
001 0 int8 1 64 480 640 0.339957 shared /relu/Relu_output_0_Relu
002 19660800 int8 1 64 480 640 0.165536 shared /relu_1/Relu_output_0_Relu
003 0 int8 1 64 240 320 0.165536 shared /pool/MaxPool_output_0_MaxPool
004 4915200 int8 1 64 240 320 0.231064 shared /relu_2/Relu_output_0_Relu
005 0 int8 1 64 240 320 0.269022 shared /relu_3/Relu_output_0_Relu
006 4915200 int8 1 64 120 160 0.269022 shared /pool_1/MaxPool_output_0_MaxPool
007 0 int8 1 128 120 160 0.167438 shared /relu_4/Relu_output_0_Relu
008 2457600 int8 1 128 120 160 0.154103 shared /relu_5/Relu_output_0_Relu
009 0 int8 1 128 60 80 0.154103 shared /pool_2/MaxPool_output_0_MaxPool
010 614400 int8 1 128 60 80 0.248690 shared /relu_6/Relu_output_0_Relu
011 0 int8 1 128 60 80 0.283347 shared /relu_7/Relu_output_0_Relu
012 614400 int8 1 256 60 80 0.033683 shared /relu_8/Relu_output_0_Relu
013 2457600 int8 1 65 60 80 0.335987 shared semi_Conv
014 1228800 int8 1 256 60 80 0.177987 shared /relu_9/Relu_output_0_Relu
015 0 int8 1 256 60 80 4.332494 shared /convDb/Conv_output_0_Conv
016 0 fp32 1 256 60 80 1.000000 io_mem /convDb/Conv_output_0_Conv_f32
017 0 fp32 1 65 60 80 1.000000 io_mem semi_Conv_f32
maybe bacause the input size? try change to smaller input size
maybe bacause the input size? try change to smaller input size
I tried YOLO11n 640x640 it has ~11ms. Is there any profiler tools I can use to investigate performance bottlenecks? I noticed that toolkit used to quantise and compile model for TPU use some TPU emulation but lacking documentation
maybe you can change different output node to debug which node spend so much time
your model is simple, change different output node and export bf16 or int8 both fast, just try
and don't use --quant_input arg if you use MaixPy