PIDNet
PIDNet copied to clipboard
Training hangs in the middle
Hi,
I've been having this issue on multiple machines, when I start training a model (custom dataset), the training would just hang in the middle, without any error. It just stops working, the GPU temperature goes down and no progress in the epoch/iterations is observed.
Sometimes it happens after 20 epochs and once it managed to get to 150 and then stopped. Have anyone seen something similar? I suspect it might be related to my CUDA/PyTorch version (?) what versions would you recommend?
Thanks!