TensorFlow-Lite-Object-Detection-on-Android-and-Raspberry-Pi Why does the Coral USB Accelerator require more time than the CPU of the RasPi4 when analyzing single images with TFLite_detection

When using the Coral USB Accelerator on a RasPi 4 (4 GB) with Raspbian (debian 10.11), the performance of TFLite_detection_webcam.py for analyzing webcam videos is much better (12-20 FPS) than without the Coral USB Accelerator (3-4 FPS). This is a result as expected.

But when I use TFLite_detection_image.py for analyzing single images, it is faster using no Coral USB Accelerator.

Is this a normal observation? What causes the performance loss?

I modified TFLite_detection_image.py a bit, just to write the output into a file and not create a window.

https://github.com/christianbaun/pestdetector/blob/main/TFLite_detection_image_modified.py

When I check the time, that is required, to analyze an image with the command line tool time, the real time is longer than without the Coral USB Accelerator.

Without the Coral USB Accelerator:

$ time python3 TFLite_detection_image_modified.py \
--modeldir=/home/pi/model_2021_07_08 \
--graph=detect.tflite \
--labels=/home/pi/model_2021_07_08/labelmap.txt \
--image=testimage.jpg

real	0m1,174s
user	0m1,236s
sys	0m0,754s

With the Coral USB Accelerator:

$ time python3 TFLite_detection_image_modified.py \
--modeldir=/home/pi/model_2021_07_08 \
--graph=detect_edgetpu.tflite \
--labels=/home/pi/model_2021_07_08/labelmap.txt \
--edgetpu \
--image=testimage.jpg 

real	0m3,831s
user	0m1,118s
sys	0m0,729s

I also tried a loop over 170 images and the result was a real time less than 2 minutes without the Coral USB Accelerator compared with more than 10 minutes when using the Coral USB Accelerator.

What causes this? Why does the Coral USB Accelerator influence the performance when negatively when analyzing single images?

Is there any chance to improve the situation and have some benefit when using the Coral USB Accelerator for this (non-video) purpose?

$ uname -a
Linux raspberrypi 5.10.63-v7l+ #1496 SMP Wed Dec 1 15:58:56 GMT 2021 armv7l GNU/Linux

Dec 21 '21 11:12 christianbaun

The performance of the usb port is not the root cause of the issue.

$ lsusb -t
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 5000M
    |__ Port 2: Dev 4, If 0, Class=Application Specific Interface, Driver=, 5000M
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/1p, 480M
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/4p, 480M

The Coral USB Accelerator is connected to oneof the USB 3.0 ports of the RasPi4.

Dec 21 '21 12:12 christianbaun

Maybe an explanation for the bad performance can be found in the Google Coral examples:

https://github.com/google-coral/pycoral/blob/master/examples/classify_image.py

Note: The first inference on Edge TPU is slow because it includes',
        'loading the model into Edge TPU memory.')

This makes sense, and when using the Coral USB Accelerator to several image files, the model is loaded into the TPU memory for every image file. This overhead does occur just a single time when a video file or stream is processed.

If this is the root cause for the bad performance, I see only two possible solutions:

Copy the model into the TPU memory in advance and avoid loading it for every image (is this possible?) or
Provide a stream instead of single images.

If all these assumptions are correct, is using the Coral USB Accelerator useful at all for analyzing single images?

Dec 21 '21 16:12 christianbaun

Thanks for investigating this! I'm not surprised that it takes more time to run a single image, but it should be faster when you run multiple images. How did you test the 170 images? You can use the imagedir argument to point the script at a folder full of images (e.g. --imagedir=testimages), and it will loop through all of them without needing to modify the code.

Can you try putting the 170 images in a folder, running the TFLite_detection_image.py script, and showing me the resulting execution time with and without the USB Accelerator?

Dec 21 '21 17:12 EdjeElectronics

I tested the 170 images with the Coral USB Accelerator this way:

#!/bin/bash
NUMBER_OF_RUNS=0
for datei in $(find "/home/pi/images/" -type f | egrep -i "\.jpg|\.jpeg")
do
  NUMBER_OF_RUNS=$(echo "${NUMBER_OF_RUNS} + 1" | bc)
  echo "run ${NUMBER_OF_RUNS}"
  python3 ../TFLite_detection_image_modified.py \
  --modeldir=/home/pi/model_2021_07_08 \
  --graph=detect_edgetpu.tflite \
  --labels=/home/pi/model_2021_07_08/labelmap.txt \
  --edgetpu \
  --image=${DATEI}
done

$ time ./performance_test_coral_tpu 
...
real	10m7,781s
user	2m39,826s
sys	1m53,368s

And I tested the 170 images without the Coral USB Accelerator this way:

#!/bin/bash
NUMBER_OF_RUNS=0
for datei in $(find "/home/pi/images/" -type f | egrep -i "\.jpg|\.jpeg")
do
  NUMBER_OF_RUNS=$(echo "${NUMBER_OF_RUNS} + 1" | bc)
  echo "run ${NUMBER_OF_RUNS}"
  python3 ../TFLite_detection_image_modified.py \
  --modeldir=/home/pi/model_2021_07_08 \
  --graph=detect.tflite \
  --labels=/home/pi/model_2021_07_08/labelmap.txt \
  --image=${DATEI}
done

$ time ./performance_test
...
real	1m58,900s
user	2m32,782s
sys	1m41,334s

Sadly, the modification of your code I did to write the image to a file (instead of opening a window), returns a nasty error message when I use the --imagedir argument instead of --image. And sadly, I am not smart enough to fix this.

$ python3 TFLite_detection_image_modified.py --modeldir=/home/pi/model_2021_07_08 --graph=detect_edgetpu.tflite --labels=/home/pi/model_2021_07_08/labelmap.txt --edgetpu --imagedir=/home/pi/images/
/home/pi/model_2021_07_08/detect_edgetpu.tflite

Traceback (most recent call last):
  File "TFLite_detection_image_modified.py", line 223, in <module>
    cv2.imwrite(filename, image)
cv2.error: OpenCV(4.5.4) /tmp/pip-wheel-2c57qphc/opencv-python_86774b87799240fbaa4c11c089d08cc3/opencv/modules/imgcodecs/src/loadsave.cpp:728: error: (-2:Unspecified error) could not find a writer for the specified extension in function 'imwrite_'

Your code works perfectly, but I cannot measure the time because a window is created.

Dec 21 '21 18:12 christianbaun

I succeded in fixing my modified version of your script in a way that the --imagedir argument works again. I did test runs with the same folder of images I tested 3-4 weeks ago, and the acceleration effect of the Coral TPU accelerator is visible.

Without the Coral USB Accelerator:

$ time python3 TFLite_detection_image_modified.py --modeldir=/home/pi/model_2021_07_08/ --graph=detect.tflite --labels=/home/pi/model_2021_07_08/labelmap.txt --imagedir=/home/pi/images/
...
real	0m58,304s
user	0m54,717s
sys	0m5,715s

With the Coral USB Accelerator:

$ time python3 TFLite_detection_image_modified.py --modeldir=/home/pi/model_2021_07_08/ --graph=detect_edgetpu.tflite --labels=/home/pi/model_2021_07_08/labelmap.txt  --edgetpu --imagedir=/home/pi/images/
...
real	0m20,073s
user	0m10,619s
sys	0m5,698s

When I compare this with the measurements, I did 3-4 weegs ago, it is obvious that working with folders of images performs much better compared with handling single-images.

The results of using the Coral USB Accelerator in directory-mode are approx. 6 times better compared with using just the CPU in single-image mode and it is approx. 30 times better compared with using the Coral TPU accelerator in single-image mode.

I also wonder what using just the CPU in directory-mode is approx. twice as fast compared with with using just the CPU in single-image mode. This is a strong acceleration. I had not expected such a strong effect, probably caused by the overhead of starting/stopping the python interpreter for every image and having several more context switches (process switching).

It is sad that I probably cannot accelerate the workflow for single images with the Coral TPU accelerator, because it is according to my opinion the most flexible use case.

I still dig for a solution here. But up to now, no solution is found. I anyone here has an idea, then I would appreciate a reply here or a message.

Jan 10 '22 09:01 christianbaun

TensorFlow-Lite-Object-Detection-on-Android-and-Raspberry-Pi
TensorFlow-Lite-Object-Detection-on-Android-and-Raspberry-Pi copied to clipboard

Why does the Coral USB Accelerator require more time than the CPU of the RasPi4 when analyzing single images with TFLite_detection_image.py ?

Without the Coral USB Accelerator:

With the Coral USB Accelerator:

TensorFlow-Lite-Object-Detection-on-Android-and-Raspberry-Pi TensorFlow-Lite-Object-Detection-on-Android-and-Raspberry-Pi copied to clipboard

Why does the Coral USB Accelerator require more time than the CPU of the RasPi4 when analyzing single images with TFLite_detection_image.py ?

Without the Coral USB Accelerator:

With the Coral USB Accelerator:

TensorFlow-Lite-Object-Detection-on-Android-and-Raspberry-Pi
TensorFlow-Lite-Object-Detection-on-Android-and-Raspberry-Pi copied to clipboard