TrainYourOwnYOLO
TrainYourOwnYOLO copied to clipboard
Multiple streams in multiple GPUs and in multiple processes finally working
Hi, @AntonMu:
After many long nights, I have finally achieved true YOLO multiprocessing using one, two, many GPUs for one, two, many completely separate YOLO instances/processes at high speed.
- Subdividing the GPU memory of one card, I can process two (and possibly more) video streams on one GPU in completely independent processes, each with their own, separate model, if necessary. On a 1080ti working on two streams of 1020x780 color video, I get a Yolo rate of 14-15fps per stream, or a total of around 30 frames per second run through Yolo.
- I can run the two independent streams in two independent GPUs, both working for their independent process, one for each stream. This gives me 22-23 fps per stream on a 1080ti, and 11-12 fps on a 1060, both running in the same machine at the same time. With two 1080ti in the machine, I get 22-23 fps on both streams, or 44-46 fps total. The cameras I’m using can’t do better than 22-23fps anyway, it’s possible than it will go a few frames higher. Round trip through YOLO is 0.036 sec, which comes out to a theoretical max rate of 28fps, minus other overhead. A 2080ti should achieve more than 25fps.
- Combinations of the above are very much possible. 4 streams at around 14-15 fps each should be doable (with a big A/C, each card sucks around 200 Watts….)
- This has been achieved with minimal changes to yolo.py, mainly in the initialization of the YOLO class. The main routine is a major rewrite, because it is video-centric, and because I am working with config files, one each for each stream and process.
I would like to donate this to the cause. How would you want me to go about it? I will probably need a week or two for cleaning up the code anyway. Please let me know.
Hi @bertelschmitt ,
That is great to hear. I'm definitely happy to have you contribute your findings to this repo.
You should write it up and then make a pull request to this repo.
Thank you!
Will do. I'm still testing. I managed to squeeze up to 13 independent Yolo processes at a time into one 1080 ti. I also am coming across some wrinkles, for instance the 2nd GPU not working at all in an older motherboard, despite being recognized by nvidia-smi. Works great in a new mobo. Also, antique GPUs (tested with GTX 670 and 760) seem to be out.
Will come back when matters have stabilized.
B
Update: Core functionality (i.e. changes to yolo.py to accommodate multiple and fractional GPUs) appears to be solid (as much as I can claim that from testing on two machines.) Working on a sample app using the new interface to capture, classify, display and store multiple streams using multiple and/or fractional GPUs and Python multiprocessing.
@AntonMu: Which branch do you want me to use as basis for the changes? Currently using a master from early 2020. I plan to annotate my changes with a tag like #BS071220, That OK, or do you want something else?
@bertelschmitt
can you please share the code?
@shahzaibraza37 , I will;. Give me a few days for cleanup
@shahzaibraza37 , I will;. Give me a few days for cleanup
pls do share.
@johnjaiharjose : Soon. Am in the middle of cleanup. Here is an appetizer:
@johnjaiharjose : Soon. Am in the middle of cleanup. Here is an appetizer:
Woah, Truly Amazing.Waiting badly for your code.
I see you have 2 models for each camera view. Few Questions popped up in my mind. Hope you dont mind.
-
Are you using 4 gpu 2 Models?
-
Is it possible to run 2 models (different Camera view like ur case) on a single gpu after memory limiting. (Like 2gb each model, at same time, Inference only )
-
Explain your data pipeline in the cat application briefly please.
My implementation will allow you to
- run as many INDEPENDENT processes per GPU as you GPU and systems memory allow
- run on as many GPUs you can fit into your machine
- use from one model for all processes to as many separate models as you have processes
For me, a process takes a little less than 1 Gbyte of GPU memory. An 11 Gbyte 1080ti can accommodate up to 11 processes. (The example is running its 4 processes easily on one GPU, even a cheaper one)
Two 1080ti , 22 processes.
Each process now demands around 2.5 Gbyte of main memory (Tensorflow 2.X is a bit more memory greedy that 1.X.) 22 x 2.5 = 55Gbyte ...... You better have a the memory to go along with the GPUs.
The pipeline for many processes is the same as for one:
On initialization, the main module reads and parses the config file, and turns it into as many config settings as there are processes. Main launches each process with its custom settings. Main launches a Master process that acts as a communication and management hub for all processes. Main then keeps checking that all processes are still alive, and otherwise does nothing. In each process, the flow is as follows:
- Frame is captured from video source as specified (can be different for each process)
- Frame is run through YOLO model for inference (model can be different for each process)
- Result is displayed in window (different for each process)
- Result can optionally (and automatically on object detection) be saved to individual video file.
- Optionally, a file with meta data per frame can be saved (different for each process.) This file can later be used for very efficient training.
The project comes in two parts:
- A modified yolo.py - This has already been running for months 24/7 on multiple GPUs and is stable.
- MultiDetect.py , an application that makes use of the modified yolo.py - This is turning into a monster, it requires heavy inter-process communication, it uses a massive config file for endless customization, it uses tkinter for status and user interaction. It is 99% done, but there is always something. I hope to be finished in a few days. I believe MultiDetect.py is necessary to provide for a good "out-of-the-box experience" so to speak. I want the user be able to fire it up, and see something happening.
Here is an image of 18 separate processes running on two GPUs:
@bertelschmitt Thank you so much for your detailed and well explained answers! This clears so many things for me! 👍
Daijoubu, as we say in Japan
Downloading Pretrained Weights
| | # | 7533 Elapsed Time: 0:00:05 Downloaded Pretrained Weights in 6.8 seconds
Detecting Cat Faces by calling:
python /content/TrainYourOwnYOLO/3_Inference/Detector.py --input_path /content/TrainYourOwnYOLO/Data/Source_Images/Test_Images --classes /content/TrainYourOwnYOLO/Data/Model_Weights/data_classes.txt --output /content/TrainYourOwnYOLO/Data/Source_Images/Test_Image_Detection_Results --yolo_model /content/TrainYourOwnYOLO/Data/Model_Weights/trained_weights_final.h5 --box_file /content/TrainYourOwnYOLO/Data/Source_Images/Test_Image_Detection_Results/Detection_Results.csv --anchors /content/TrainYourOwnYOLO/2_Training/src/keras_yolo3/model_data/yolo_anchors.txt --file_types .jpg .jpeg .png
2021-03-01 15:16:10.514250: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
File "/content/TrainYourOwnYOLO/3_Inference/Detector.py", line 21, in
from keras_yolo3.yolo import YOLO, detect_video, detect_webcam
File "/content/TrainYourOwnYOLO/2_Training/src/keras_yolo3/yolo.py", line 18, in
from keras.utils import multi_gpu_model
ImportError: cannot import name 'multi_gpu_model' from 'keras.utils' (/usr/local/lib/python3.7/dist-packages/keras/utils/init.py)
Detected Cat Faces in 2.1 seconds
this error comes often in colab itself how to solve this.
@bertelschmitt Bothering late, but curious to know whether you have hosted this challenging work anywhere to the public?
No, sorry. Just using it for myself. Have switched to Coral TPUs lately. Much cheaper, and use way less power. Also, implemented in just a few lines of Python.