mmsegmentation icon indicating copy to clipboard operation
mmsegmentation copied to clipboard

MemoryError message during the validation phase of training

Open fschvart opened this issue 2 years ago • 0 comments

I'm running the UperNet-Swin model with a custom binary dataset.

The first training steps proceed well and then when the validation step kicks in this is the error message that I receive:

File "C:\ProgramData\Miniconda3\envs\mmsegment\lib\site-packages\torch\serialization.py", line 379, in save _save(obj, opened_zipfile, pickle_module, pickle_protocol) File "C:\ProgramData\Miniconda3\envs\mmsegment\lib\site-packages\torch\serialization.py", line 604, in _save zip_file.write_record(name, storage.data_ptr(), num_bytes) MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\ProgramData\Miniconda3\envs\mmsegment\lib\site-packages\torch\serialization.py", line 380, in save return File "C:\ProgramData\Miniconda3\envs\mmsegment\lib\site-packages\torch\serialization.py", line 259, in exit self.file_like.write_end_of_file() RuntimeError: [enforce fail at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\caffe2\serialize\inline_container.cc:319] . unexpected pos 1725150400 vs 1725150288

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\mmsegmentation\custom_dataset.py", line 130, in train_segmentor(model, datasets, cfg, distributed=False, validate=True,meta=dict()) File "C:\mmsegmentation\mmseg\apis\train.py", line 194, in train_segmentor runner.run(data_loaders, cfg.workflow) File "C:\ProgramData\Miniconda3\envs\mmsegment\lib\site-packages\mmcv\runner\iter_based_runner.py", line 144, in run iter_runner(iter_loaders[i], **kwargs) File "C:\ProgramData\Miniconda3\envs\mmsegment\lib\site-packages\mmcv\runner\iter_based_runner.py", line 70, in train self.call_hook('after_train_iter') File "C:\ProgramData\Miniconda3\envs\mmsegment\lib\site-packages\mmcv\runner\base_runner.py", line 317, in call_hook getattr(hook, fn_name)(self) File "C:\ProgramData\Miniconda3\envs\mmsegment\lib\site-packages\mmcv\runner\hooks\checkpoint.py", line 168, in after_train_iter self._save_checkpoint(runner) File "C:\ProgramData\Miniconda3\envs\mmsegment\lib\site-packages\mmcv\runner\dist_utils.py", line 135, in wrapper return func(*args, **kwargs) File "C:\ProgramData\Miniconda3\envs\mmsegment\lib\site-packages\mmcv\runner\hooks\checkpoint.py", line 122, in _save_checkpoint runner.save_checkpoint( File "C:\ProgramData\Miniconda3\envs\mmsegment\lib\site-packages\mmcv\runner\iter_based_runner.py", line 226, in save_checkpoint save_checkpoint(self.model, filepath, optimizer=optimizer, meta=meta) File "C:\ProgramData\Miniconda3\envs\mmsegment\lib\site-packages\mmcv\runner\checkpoint.py", line 809, in save_checkpoint torch.save(checkpoint, f) File "C:\ProgramData\Miniconda3\envs\mmsegment\lib\site-packages\torch\serialization.py", line 381, in save _legacy_save(obj, opened_file, pickle_module, pickle_protocol) File "C:\ProgramData\Miniconda3\envs\mmsegment\lib\site-packages\torch\serialization.py", line 225, in exit self.file_like.flush() ValueError: I/O operation on closed file.

I used the standard setup, just with changing the parameters and registering my binary dataset.

  1. What dataset did you use?

Environment

sys.platform: win32 Python: 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:51:29) [MSC v.1929 64 bit (AMD64)] CUDA available: True GPU 0,1: NVIDIA GeForce RTX 3090 CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6 NVCC: Cuda compilation tools, release 11.6, V11.6.124 MSVC: Microsoft (R) C/C++ Optimizing Compiler Version 19.32.31332 for x64 GCC: n/a PyTorch: 1.12.1+cu116 PyTorch compiling details: PyTorch built with:

  • C++ Version: 199711
  • MSVC 192829337
  • Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  • OpenMP 2019
  • LAPACK is enabled (usually provided by MKL)
  • CPU capability usage: AVX2
  • CUDA Runtime 11.6
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.3.2 (built against CUDA 11.5)
  • Magma 2.5.4
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=C:/actions-runner/_work/pytorch/pytorch/builder/windows/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/actions-runner/_work/pytorch/pytorch/builder/windows/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.13.1+cu116 OpenCV: 4.6.0 MMCV: 1.6.1 MMCV Compiler: MSVC 192930140 MMCV CUDA Compiler: 11.6 MMSegmentation: 0.26.0+13d4c39

fschvart avatar Aug 07 '22 22:08 fschvart