yolov2-yolov3_PyTorch
yolov2-yolov3_PyTorch copied to clipboard
Multiple processes of training
Hello @yjh0410,
Thanks for the share of your work! I am training the yolov2 darknet19 model on my machine, and have realized 2 problems:
- After some time, training stops although the training process is running. I realized this only after forwarding the stdouts to be logged into a txt file, and had to kill and resume training 5 times within 120 epochs to be able to continue.
- When the training pid is killed, the memory of the gpus didn't get flushed out. When I do sudo fuser -v /dev/nvidia*, I saw that 8 more processes were created along with the parent process shown on nvtop/nvidia-smi. Do you know why 9 instances of the same process are created when the code isn't multithreaded?

I'm running on ubuntu 18.04 lts, python 3.7.6, torch 1.7.0 and using GeForce GTX 1080 Ti.
Hi~ Thanks for your support.
As for your issue, I think it is casued by the dataloader as it used multi workers (default num_workers is 8)to preprocess input data. In addition, I use cv2 to process image including cv2.imread, cv2.resize and so on. It also use multi threads by default.
I also encounter this problem sometimes, but I'm not sure how to fix it.
Thank you for the response, yes I forgot that 8 workers would use 8 threads to process input data. I have another question, what is the difference between Slim YOLOv2 and Tiny YOLOv2? I know that in pjreddie's implementation they use Darknet Reference Model as the backbone and you use your Darknet Tiny, can you elaborate more on the main differences?
I believe you can make your training even faster if you launch your training for epoch n+1 when test/eval for epoch n is continuing on CPUs, so that your GPU(s) are always working at 100%. You can also look into pipelining to fasten up the process even more with PyTorch 1.9.0.
Another good addition could be to support to use DP to support multi-GPU if multiprocessing fails with torch.distributed.launch.
if args.cuda and torch.cuda.device_count() > 1: model= nn.DataParallel(model)
I noticed a small typo at the end of readme, command lines should have "--trained_model" instead of "--train_model". Also the google drive folder is access restricted.
@beyza-yildirim Thanks for your advice, I will update this project following your suggestion.