yolov2-yolov3_PyTorch icon indicating copy to clipboard operation
yolov2-yolov3_PyTorch copied to clipboard

Multiple processes of training

Open beyza-yildirim opened this issue 4 years ago • 3 comments

Hello @yjh0410,

Thanks for the share of your work! I am training the yolov2 darknet19 model on my machine, and have realized 2 problems:

  1. After some time, training stops although the training process is running. I realized this only after forwarding the stdouts to be logged into a txt file, and had to kill and resume training 5 times within 120 epochs to be able to continue.
  2. When the training pid is killed, the memory of the gpus didn't get flushed out. When I do sudo fuser -v /dev/nvidia*, I saw that 8 more processes were created along with the parent process shown on nvtop/nvidia-smi. Do you know why 9 instances of the same process are created when the code isn't multithreaded? image

I'm running on ubuntu 18.04 lts, python 3.7.6, torch 1.7.0 and using GeForce GTX 1080 Ti.

beyza-yildirim avatar Jul 22 '21 08:07 beyza-yildirim

Hi~ Thanks for your support.

As for your issue, I think it is casued by the dataloader as it used multi workers (default num_workers is 8)to preprocess input data. In addition, I use cv2 to process image including cv2.imread, cv2.resize and so on. It also use multi threads by default.

I also encounter this problem sometimes, but I'm not sure how to fix it.

yjh0410 avatar Jul 22 '21 08:07 yjh0410

Thank you for the response, yes I forgot that 8 workers would use 8 threads to process input data. I have another question, what is the difference between Slim YOLOv2 and Tiny YOLOv2? I know that in pjreddie's implementation they use Darknet Reference Model as the backbone and you use your Darknet Tiny, can you elaborate more on the main differences?

I believe you can make your training even faster if you launch your training for epoch n+1 when test/eval for epoch n is continuing on CPUs, so that your GPU(s) are always working at 100%. You can also look into pipelining to fasten up the process even more with PyTorch 1.9.0. Another good addition could be to support to use DP to support multi-GPU if multiprocessing fails with torch.distributed.launch. if args.cuda and torch.cuda.device_count() > 1: model= nn.DataParallel(model) I noticed a small typo at the end of readme, command lines should have "--trained_model" instead of "--train_model". Also the google drive folder is access restricted.

beyza-yildirim avatar Jul 23 '21 13:07 beyza-yildirim

@beyza-yildirim Thanks for your advice, I will update this project following your suggestion.

yjh0410 avatar Aug 06 '21 07:08 yjh0410