blueoil icon indicating copy to clipboard operation
blueoil copied to clipboard

make an inference faster in python

Open masato0412 opened this issue 4 years ago • 0 comments

The difference between the speed of lm_fpga.elf and that of demo script in python.

lm_fpga.elf result
-------------------------------------------------------------
Comparison: Default network test  succeeded!!!
-------------------------------------------------------------
TotalInitTime 42411,  sum:42.411ms
TotalRunTime 85897,  sum:85.897ms
..Lookup 3612,  sum:3.612ms
..QuantizedConv2D 4751,9907,2638,1261,1051,937,587,325,146,207,376,1102,201,569,2021,7841,10844,10853,  sum:55.617ms
....Convert Tensor 1511,1469,331,95,41,21,13,9,16,25,54,24,8,13,42,296,1502,1558,  sum:7.028ms
....Sync UDMABuf Input 1406,1349,470,262,137,76,46,21,46,73,133,78,22,45,137,466,1355,1303,  sum:7.425ms
....Conv2D TCA 851,6128,1560,803,799,790,491,257,42,61,94,961,123,415,1564,6122,6130,6123,  sum:33.314ms
....Sync UDMABuf Output 928,929,249,79,49,31,18,21,20,29,76,19,29,77,259,932,1822,1840,  sum:7.407ms
..Memcpy 1469,1462,439,99,55,30,14,18,17,32,92,14,29,95,391,1536,  sum:5.792ms
..ExtractImagePatches 2012,637,82,69,25,8,39,78,  sum:2.95ms
..func_ConcatOnDepth 40,3080,  sum:3.12ms
..DepthToSpace 46,163,643,2560,  sum:3.412ms
..QuantizedConv2D_ApplyScalingFactor 2905,2112,  sum:5.017ms
..BatchNorm 1908,946,  sum:2.854ms
..Add 1378,624,  sum:2.002ms
run.py result
INFO:__main__:Benchmark avg result(sec) for 20 trials: pre_process: 0.03704105  inference: 0.09313415 post_process: 0.0644722  Total: 0.1946474

About 10ms difference is seen in inference. Measure processing in nnlib.py

>>> flatten, 3.1120777130126953 ms
>>> cast, 3.3941268920898438 ms
>>> zeros, 0.5159378051757812 ms
>>> inference, 85.60395240783691 ms

Caused by python processing before inference.

tensor.flatten() → tensor.ravel()
flatten returns a copy of the input array, flattened to one dimension (Using different memory from the original array).
ravel returns new view object if possible (Refers to the same memory as the original array).
>>> ravel, 0.02288818359375 ms
>>> np.unique(tensor.ravel() - tensor.flatten()))
>>> [0.]

Eliminate cast to float32 before inference then fixed to float32 when converting from PIL to numpy array in resize function.

>>>pre_process: 0.03395725

preprocess is slightly faster.

Since a zero array with the same shape as output is generated each time, it overwrites the result of the previous inference.

result:

INFO:__main__:Benchmark avg result(sec) for 20 trials: pre_process: 0.03395725  inference: 0.0850629 post_process: 0.0656032  Total: 0.18462335

got the same inference speed as lm_fpga.elf.

masato0412 avatar Mar 10 '20 09:03 masato0412