HGL-pytorch icon indicating copy to clipboard operation
HGL-pytorch copied to clipboard

i have problem about restore checkpoint!

Open jaeyun95 opened this issue 5 years ago • 2 comments

hi! i have problem about restore checkpoint. It stopped learning, so I tried to restore but got an error. help! T^T

restore is True
Found folder! restoring
Traceback (most recent call last):
  File "train.py", line 122, in <module>
    learning_rate_scheduler=scheduler)
  File "/home/ailab/HGL-pytorch/utils/pytorch_misc.py", line 226, in restore_checkpoint
    training_state = torch.load(training_state_path, map_location=device_mapping(-1))
  File "/home/ailab/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/serialization.py", line 368, in load
    return _load(f, map_location, pickle_module)
  File "/home/ailab/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/serialization.py", line 549, in _load
    deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: unexpected EOF, expected 4859355 more bytes. The file might be corrupted.
terminate called after throwing an instance of 'c10::Error'
  what():  owning_ptr == NullType::singleton() || owning_ptr->refcount_.load() > 0 ASSERT FAILED at /opt/conda/conda-bld/pytorch_1549628766161/work/c10/util/intrusive_ptr.h:350, please report a bug to PyTorch. intrusive_ptr: Can only intrusive_ptr::reclaim() owning pointers that were created using intrusive_ptr::release(). (reclaim at /opt/conda/conda-bld/pytorch_1549628766161/work/c10/util/intrusive_ptr.h:350)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f3920592cf5 in /home/ailab/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: THStorage_free + 0xca (0x7f38d72a68ea in /home/ailab/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #2: <unknown function> + 0x12c11d (0x7f39208d011d in /home/ailab/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #17: __libc_start_main + 0xf0 (0x7f39266a8830 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

jaeyun95 avatar Jan 22 '20 13:01 jaeyun95

It seems that you loaded an uncompleted file. Could you check the saving path of your checkpoint to make sure whether the checkpoint is saved?

yuweijiang avatar Feb 24 '20 23:02 yuweijiang

Hi, I want to know how many GPU memories do you use for successfully running this code? @jaeyun95

tuyunbin avatar Jun 16 '20 02:06 tuyunbin