sdvae icon indicating copy to clipboard operation
sdvae copied to clipboard

out of memory

Open xuzhang5788 opened this issue 6 years ago • 2 comments

@Hanjun-Dai Thnk you so much loading your code into github.

I followed your instruction from 1 to 5 skipped 4. Everything is going well. However, when I try ./run_sample_prior.sh and ./run_valid_prior.sh, I got error messages like this.

xuzhang@xuzhang1:/media/projects/sdvae/mol_vae/pytorch_eval$ ./run_sample_prior.sh save_dir for use is ../../dropbox/results/zinc using vae a Conv1d inited a Conv1d inited a Conv1d inited a Linear inited a Linear inited a Linear inited a Linear inited /media/projects/sdvae/mol_vae/pytorch_eval/../mol_common/pytorch_initializer.py:36: UserWarning: nn.init.orthogonal is now deprecated in favor of nn.init.orthogonal_. nn.init.orthogonal(x0) /media/projects/sdvae/mol_vae/pytorch_eval/../mol_common/pytorch_initializer.py:37: UserWarning: nn.init.orthogonal is now deprecated in favor of nn.init.orthogonal_. nn.init.orthogonal(x1) /media/projects/sdvae/mol_vae/pytorch_eval/../mol_common/pytorch_initializer.py:38: UserWarning: nn.init.orthogonal is now deprecated in favor of nn.init.orthogonal_. nn.init.orthogonal(x2) a GRU inited a Linear inited THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THC/THCTensorRandom.cu line=25 error=2 : out of memory Traceback (most recent call last): File "sample_prior.py", line 65, in main() File "sample_prior.py", line 57, in main model = ProxyModel() File "/media/projects/sdvae/mol_vae/pytorch_eval/att_model_proxy.py", line 94, in init self.ae = self.ae.cuda() File "/home/xuzhang/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 249, in cuda return self._apply(lambda t: t.cuda(device)) File "/home/xuzhang/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 176, in _apply module._apply(fn) File "/home/xuzhang/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 176, in _apply module._apply(fn) File "/home/xuzhang/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 182, in _apply param.data = fn(param.data) File "/home/xuzhang/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 249, in return self._apply(lambda t: t.cuda(device)) RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THC/THCTensorRandom.cu:25

/home/xuzhang/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py(249)() -> return self.apply(lambda t: t.cuda(device)) (Pdb) [7]+ Stopped ./run_sample_prior.sh xuzhang@xuzhang1:/media/projects/sdvae/mol_vae/pytorch_eval$ ./run_valid_prior.sh save_dir for use is ../../dropbox/results/zinc using vae a Conv1d inited a Conv1d inited a Conv1d inited a Linear inited a Linear inited a Linear inited a Linear inited /media/projects/sdvae/mol_vae/pytorch_eval/../mol_common/pytorch_initializer.py:36: UserWarning: nn.init.orthogonal is now deprecated in favor of nn.init.orthogonal. nn.init.orthogonal(x0) /media/projects/sdvae/mol_vae/pytorch_eval/../mol_common/pytorch_initializer.py:37: UserWarning: nn.init.orthogonal is now deprecated in favor of nn.init.orthogonal_. nn.init.orthogonal(x1) /media/projects/sdvae/mol_vae/pytorch_eval/../mol_common/pytorch_initializer.py:38: UserWarning: nn.init.orthogonal is now deprecated in favor of nn.init.orthogonal_. nn.init.orthogonal(x2) a GRU inited a Linear inited THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THC/THCTensorRandom.cu line=25 error=2 : out of memory Traceback (most recent call last): File "valid_prior.py", line 59, in main() File "valid_prior.py", line 46, in main model = ProxyModel() File "/media/projects/sdvae/mol_vae/pytorch_eval/att_model_proxy.py", line 94, in init self.ae = self.ae.cuda() File "/home/xuzhang/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 249, in cuda return self._apply(lambda t: t.cuda(device)) File "/home/xuzhang/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 176, in _apply module._apply(fn) File "/home/xuzhang/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 176, in _apply module._apply(fn) File "/home/xuzhang/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 182, in _apply param.data = fn(param.data) File "/home/xuzhang/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 249, in return self._apply(lambda t: t.cuda(device)) RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THC/THCTensorRandom.cu:25 /home/xuzhang/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py(249)() -> return self._apply(lambda t: t.cuda(device)) (Pdb) [8]+ Stopped ./run_valid_prior.sh

My python is 3.5, but I converted .py files from python2 to python3 using 2to3 command. pytorch is 0.4.0. cuda is 8.0. Does this higher pytorch version cause problems? Thanks ahead.

updated:

I downgraded pytorch from 0.4.0 to 0.3.1, but errors are still there.

xuzhang5788 avatar Jul 10 '18 18:07 xuzhang5788

updated: I found after I canceled the job, cuda's memories were not released, so it accumulated until out of memory. The real errors are:

xuzhang@xuzhang1:/media/projects/sdvae/mol_vae/pytorch_eval$ ./run_sample_prior.sh save_dir for use is ../../dropbox/results/zinc using vae a Conv1d inited a Conv1d inited a Conv1d inited a Linear inited a Linear inited a Linear inited a Linear inited a GRU inited a Linear inited using mol_zinc.grammar Traceback (most recent call last): File "sample_prior.py", line 65, in main() File "sample_prior.py", line 59, in main cal_valid_prior(model, cmd_args.latent_dim) File "sample_prior.py", line 27, in cal_valid_prior decoded_array = batch_decode(raw_logits, True, decode_times=sample_times) File "/media/projects/sdvae/mol_vae/pytorch_eval/att_model_proxy.py", line 76, in batch_decode for i in range(0, raw_logits.shape[1], size): File "/home/xuzhang/anaconda3/lib/python3.5/site-packages/past/builtins/noniterators.py", line 252, in oldrange return list(builtins.range(*args, **kwargs)) TypeError: 'float' object cannot be interpreted as an integer

/home/xuzhang/anaconda3/lib/python3.5/site-packages/past/builtins/noniterators.py(252)oldrange() -> return list(builtins.range(*args, **kwargs)) (Pdb) [1]+ Stopped

I think it is because of the difference between range() and xrange(), but I am not sure how to correct it.

xuzhang5788 avatar Jul 10 '18 19:07 xuzhang5788

I solved the above problem using //8 instead of /8 in file at /sdvae/mol_vae/pytorch_eval/att_model_proxy.py line 73.

Thank you very much

xuzhang5788 avatar Jul 10 '18 21:07 xuzhang5788