RTM3D icon indicating copy to clipboard operation
RTM3D copied to clipboard

Training get stuck on line 74 'log_dict_train, _ = trainer.train(epoch, train_loader)'

Open mertmerci opened this issue 3 years ago • 4 comments

Hello,

I am trying to run the main.py file for training but training stuck at line 74 log_dict_train, _ = trainer.train(epoch, train_loader). The strange thing is the when I check the GPU utilization, I see that the GPUs are still 100% in use. Also when I debug, I see that the losses are calculated but they are neither printed in the console nor the logger file.

I am using Cuda version 11.2, maybe this is the problem but I do not think so. Do you have any ideas or suggestions for me to solve this issue?

Thank you in advance.

mertmerci avatar Mar 09 '21 11:03 mertmerci

We are also using 11.2 so that is not the issue. If you send the error, I could possibly be of some help.

sparro12 avatar Mar 15 '21 17:03 sparro12

I do not get a run time error. However, I cannot obtain the losses or even see the epochs. I attached the output of the main.py below. I inserted some print lines for debugging; as it can be seen the code terminates after line 74 without finding any losses or proceeding with multiple epochs. Screenshot 2021-03-15 at 23 40 32

mertmerci avatar Mar 15 '21 22:03 mertmerci

Before going down the rabbit hole, my best guess would be there is an error with the torch version. I would reinstall torch on that repo. Better yet, since you're not too far into the setup, I would reclone the repo and make sure you select the correct torch version when you set it up. Maybe even redo the Conda environment.

One thing that is noteworthy is that when running DCNv2, it failed for us. So, we had to reclone the DCNv2 repo in the link provided in the install.md. The DCNv2 is precompiled when cloning KM3D and uses CUDA 8.0. However, if you want to run 11.2, you'll need to reclone just the DCNv2 part and keep in the same location as the old one. Then continue on with the rest of instructions.

sparro12 avatar Mar 16 '21 20:03 sparro12

Thank you for your kind response. DCNv2 does not contribute to the problems I think because I am trying to run the training without using the models that have DCNv2, just using basic resnet-18 or dla-34.

I am using Ubuntu18, maybe this can cause a problem. Now, Ubuntu16 is installed on the machine, I will try to run it and post the result here.

mertmerci avatar Mar 28 '21 15:03 mertmerci