HRNet-Semantic-Segmentation icon indicating copy to clipboard operation
HRNet-Semantic-Segmentation copied to clipboard

freeze randomly at training with 100% GPU usage but no error

Open zhbb1989 opened this issue 4 years ago • 5 comments

Thanks for the great work! When I use HRNet+OCR to train on my own data, this problem randomly hit me, the process hangs and will never recover itself, I have to Ctrl+C to stop it and kill threads manually, then resume the trainning process. I do finish training and the result is good, but this problem is quiet annoying as it waste a lot of time at night or weekend. Anyone has some clue? My pytorch version is 1.1.0, CUDA 9 and 10 both has this problem, train on 8 1080Ti GPU, I change different GPU but didn't help. BTW, I add other models into this code, like Unet or resnet+head, when train on these models, never met this problem.

zhbb1989 avatar Jun 09 '20 03:06 zhbb1989

and my unet freeze too... so it might be a distributed training problem

zhbb1989 avatar Jun 10 '20 01:06 zhbb1989

@zhbb1989 Did you find the solution? I am having the same issue.

SHMCU avatar Sep 11 '20 17:09 SHMCU

@zhbb1989 Did you find the solution? I am having the same issue.

not yet... trying to live with it...

zhbb1989 avatar Sep 14 '20 08:09 zhbb1989

I meet the same question, have you solve it?

raozhongyu avatar May 19 '21 07:05 raozhongyu

I don't remember clearly, but seems it is a problem of the cudnn version, cuda driver version, and pytorch version.I used Pytorch 1.3, and cuda 10.2 or 10.1 or 10.0. You can try these. On Wednesday, May 19, 2021, 12:59:23 AM PDT, raozhongyu @.***> wrote:

I meet the same question, have you solve it?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

SHMCU avatar May 19 '21 16:05 SHMCU