Collaborative-Learning-for-Weakly-Supervised-Object-Detection icon indicating copy to clipboard operation
Collaborative-Learning-for-Weakly-Supervised-Object-Detection copied to clipboard

Memory leaks during training!

Open Zhang-HM opened this issue 5 years ago • 2 comments

cross_entropy: 0.001210 lr: 0.001000 speed: 2.425s / iter /gruntdata/disk2/hm/CLWSOD/tools/../lib/nets/network.py:569: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number cross_entropy, total_loss = self._losses['wsddn_loss'].data[0],
/gruntdata/disk2/hm/CLWSOD/tools/../lib/nets/network.py:570: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number self._losses['total_loss'].data[0] iter: 2 / 200000, total loss: 1.993461 cross_entropy: 0.000035 lr: 0.001000 speed: 1.868s / iter iter: 3 / 200000, total loss: 0.652638 cross_entropy: 0.016665 lr: 0.001000 speed: 1.693s / iter iter: 4 / 200000, total loss: 0.393473 cross_entropy: 0.001328 lr: 0.001000 speed: 1.708s / iter iter: 5 / 200000, total loss: 0.351719 cross_entropy: 0.000444 lr: 0.001000 speed: 1.620s / iter iter: 6 / 200000, total loss: 0.543326 cross_entropy: 0.050670 lr: 0.001000 speed: 1.600s / iter iter: 7 / 200000, total loss: 0.261543 cross_entropy: 0.000296 lr: 0.001000 speed: 1.564s / iter iter: 8 / 200000, total loss: 0.783304 cross_entropy: 0.024824 lr: 0.001000 speed: 1.529s / iter iter: 9 / 200000, total loss: 0.537496 cross_entropy: 0.011513 lr: 0.001000 speed: 1.510s / iter iter: 10 / 200000, total loss: 0.964071 cross_entropy: 0.010650 lr: 0.001000 speed: 1.489s / iter iter: 11 / 200000, total loss: 0.296966 cross_entropy: 0.020692 lr: 0.001000 speed: 1.472s / iter iter: 12 / 200000, total loss: 0.546587 cross_entropy: 0.044390 lr: 0.001000 speed: 1.480s / iter iter: 13 / 200000, total loss: 0.693391 cross_entropy: 0.001768 lr: 0.001000 speed: 1.479s / iter iter: 14 / 200000, total loss: 0.190509 cross_entropy: 0.051802 lr: 0.001000 speed: 1.474s / iter iter: 15 / 200000, total loss: 0.302866 cross_entropy: 0.053017 lr: 0.001000 speed: 1.476s / iter iter: 16 / 200000, total loss: 0.468978 cross_entropy: 0.000957 lr: 0.001000 speed: 1.456s / iter iter: 17 / 200000, total loss: 0.609222 cross_entropy: 0.007434 lr: 0.001000 speed: 1.457s / iter iter: 18 / 200000, total loss: 0.089435 cross_entropy: 0.003355 lr: 0.001000 speed: 1.458s / iter iter: 19 / 200000, total loss: 0.506788 cross_entropy: 0.002159 lr: 0.001000 speed: 1.464s / iter iter: 20 / 200000, total loss: 0.507251 cross_entropy: 0.020046 lr: 0.001000 speed: 1.464s / iter iter: 21 / 200000, total loss: 0.365586 cross_entropy: 0.113681 lr: 0.001000 speed: 1.455s / iter iter: 22 / 200000, total loss: 0.184315 cross_entropy: 0.084765 lr: 0.001000 speed: 1.467s / iter iter: 23 / 200000, total loss: 0.200998 cross_entropy: 0.048887 lr: 0.001000 speed: 1.458s / iter iter: 24 / 200000, total loss: 0.124370 cross_entropy: 0.003205 lr: 0.001000 speed: 1.461s / iter iter: 25 / 200000, total loss: 0.102922 cross_entropy: 0.059250 lr: 0.001000 speed: 1.467s / iter iter: 26 / 200000, total loss: 0.175924 cross_entropy: 0.031119 lr: 0.001000 speed: 1.495s / iter iter: 27 / 200000, total loss: 0.185290 cross_entropy: 0.002968 lr: 0.001000 speed: 1.493s / iter iter: 28 / 200000, total loss: 0.163398 cross_entropy: 0.005777 lr: 0.001000 speed: 1.484s / iter Traceback (most recent call last): File "./tools/trainval_net.py", line 149, in max_iters=args.max_iters) File "/gruntdata/disk2/hm/CLWSOD/tools/../lib/model/train_val.py", line 380, in train_net sw.train_model(max_iters) File "/gruntdata/disk2/hm/CLWSOD/tools/../lib/model/train_val.py", line 294, in train_model self.net.train_step(blobs, self.optimizer) File "/gruntdata/disk2/hm/CLWSOD/tools/../lib/nets/network.py", line 573, in train_step self._losses['total_loss'].backward() File "/gruntdata/disk1/anaconda3/envs/hm3/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/gruntdata/disk1/anaconda3/envs/hm3/lib/python3.6/site-packages/torch/autograd/init.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA error: out of memory Command exited with non-zero status 1 339.25user 70.95system 1:04.60elapsed 634%CPU (0avgtext+0avgdata 3661156maxresident)k 0inputs+184outputs (0major+2307976minor)pagefaults 0swaps

Zhang-HM avatar Jul 02 '19 03:07 Zhang-HM

Given such little info, without any quantities about your GPU, it is meaningless and inefficient to give any judgement about this error. Besides, we have pointed out the version of our pytorch is 0.2. We cannot guarantee what would happen based on the mention of your first few lines.

Sunarker avatar Jul 03 '19 01:07 Sunarker

Thank you!I solved this problem according to your suggestion!

Zhang-HM avatar Jul 16 '19 03:07 Zhang-HM