gap_sdk icon indicating copy to clipboard operation
gap_sdk copied to clipboard

Reduce high execution time when using the SDK

Open fernandoFernandeSantos opened this issue 2 years ago • 8 comments

I would need to execute the MNIST example on GAP8 and retrieve the output in the host computer inside a loop, so my application would look like this:

while 1:
       mark time before the execution 
       execute the command "gapy --target=gapuino_v2 --platform=board --work-dir=<path_sdk>/examples/autotiler/Mnist/BUILD/GAP8_V2/GCC_RISCV_PULPOS run --exec-prepare --exec --binary=<path_sdk>/examples/autotiler/Mnist/BUILD/GAP8_V2/GCC_RISCV_PULPOS/Mnist" 
       calc the execution time
       post process the output from the MNIST     

However, the average execution time that I'm getting for the simple MNIST example is 4.9s. So, is this execution time correct? Is there any way to reduce this execution time knowing that I have to post-process the application's output on a host computer for each iteration?

My host computer is an Ubuntu 20.04 connected to GAPUINO by USB cable.

Thanks in advance.

fernandoFernandeSantos avatar Apr 29 '22 12:04 fernandoFernandeSantos

I also meet this problem on the sdk 4.8.0+, in the sdk3.8.1 is no problem

aqqz avatar Apr 29 '22 13:04 aqqz

Hi @fernandoFernandeSantos and @aqqz

Fyi, execution time not = real time on chip. The simulator doesn't work in real-time, it simulate how many CYCLES an application running on the target.

For example, an application run in GVSOC, with perf counter we get maybe 5 million cycles. In this case, on GAP8, since we can change the frequency of GAP8 from 0 - 175MHz. So for example, when the chip running at 175MHz, it takes : 5/175 = about 28ms.

However, if you measure the GVSOC execution time, it may takes several seconds, because the GVSOC itself on your host will need time to "simulate" all the signals. Therefore, if you run it on a high performance PC, it will be faster.

You feel the execution time difference here, doesn't mean the difference on the target has been changed. But only because we have added more features in GVSOC which causes the GVSOC is heavier.

Yaooooo avatar Apr 30 '22 11:04 Yaooooo

Thanks for the reply @Yaooooo But in my case, I'm not running on GVSOC. I need to run multiple times on the GAPUINO board and communicate using USB JTAG, so in this case, is it still expected to have high execution times?

fernandoFernandeSantos avatar Apr 30 '22 11:04 fernandoFernandeSantos

Thanks for your advise, on the chip, the first time run normal, but next and laster it stop on call cluster @Yaooooo

aqqz avatar Apr 30 '22 11:04 aqqz

@fernandoFernandeSantos unfortunately yes, it's due to the printf via JTAG -> usb, which is very slow. One way to optimize it is using io=uart to have the printf via uart.

Yaooooo avatar May 10 '22 11:05 Yaooooo

@aqqz can you please describe a bit more about how you run it? What you mean first time and next? you have put a loop inside? or just rerun it with "make run" ?

Yaooooo avatar May 10 '22 11:05 Yaooooo

@fernandoFernandeSantos unfortunately yes, it's due to the printf via JTAG -> usb, which is very slow. One way to optimize it is using io=uart to have the printf via uart.

Thanks, @Yaooooo. Is there a way to pass the data through USB/UART without using printf? I don`t really need the printf; I just need to post-process the data from the gapuino in a fast manner. It can be an array of bytes.

fernandoFernandeSantos avatar May 13 '22 09:05 fernandoFernandeSantos

@Yaooooo Hello, how large model can AIdeck run on it? I trained a 3MB model which could run on the gvsoc but on AIdeck failed, it stuck at run cluster. Is the model is to large? I think the hyperflash is larger than 3MB.

aqqz avatar Jun 01 '22 11:06 aqqz