yolov2-yolov3_PyTorch Multiple processes of training

Hello @yjh0410,

Thanks for the share of your work! I am training the yolov2 darknet19 model on my machine, and have realized 2 problems:

After some time, training stops although the training process is running. I realized this only after forwarding the stdouts to be logged into a txt file, and had to kill and resume training 5 times within 120 epochs to be able to continue.
When the training pid is killed, the memory of the gpus didn't get flushed out. When I do sudo fuser -v /dev/nvidia*, I saw that 8 more processes were created along with the parent process shown on nvtop/nvidia-smi. Do you know why 9 instances of the same process are created when the code isn't multithreaded?

I'm running on ubuntu 18.04 lts, python 3.7.6, torch 1.7.0 and using GeForce GTX 1080 Ti.

Jul 22 '21 08:07 beyza-yildirim

Hi~ Thanks for your support.

As for your issue, I think it is casued by the dataloader as it used multi workers （default num_workers is 8）to preprocess input data. In addition, I use cv2 to process image including cv2.imread, cv2.resize and so on. It also use multi threads by default.

I also encounter this problem sometimes, but I'm not sure how to fix it.

Jul 22 '21 08:07 yjh0410

Thank you for the response, yes I forgot that 8 workers would use 8 threads to process input data. I have another question, what is the difference between Slim YOLOv2 and Tiny YOLOv2? I know that in pjreddie's implementation they use Darknet Reference Model as the backbone and you use your Darknet Tiny, can you elaborate more on the main differences?

I believe you can make your training even faster if you launch your training for epoch n+1 when test/eval for epoch n is continuing on CPUs, so that your GPU(s) are always working at 100%. You can also look into pipelining to fasten up the process even more with PyTorch 1.9.0. Another good addition could be to support to use DP to support multi-GPU if multiprocessing fails with torch.distributed.launch. if args.cuda and torch.cuda.device_count() > 1: model= nn.DataParallel(model) I noticed a small typo at the end of readme, command lines should have "--trained_model" instead of "--train_model". Also the google drive folder is access restricted.

Jul 23 '21 13:07 beyza-yildirim

@beyza-yildirim Thanks for your advice, I will update this project following your suggestion.

Aug 06 '21 07:08 yjh0410

yolov2-yolov3_PyTorch yolov2-yolov3_PyTorch copied to clipboard

Multiple processes of training

yolov2-yolov3_PyTorch
yolov2-yolov3_PyTorch copied to clipboard