VW icon indicating copy to clipboard operation
VW copied to clipboard

batch problem in training process

Open onionysy opened this issue 2 years ago • 0 comments

In the course of training, we encountered this problem `/home/buaa/anaconda3/envs/vit/bin/python3.6 /snap/pycharm-community/302/plugins/python-ce/helpers/pydev/pydevd.py --multiprocess --qt-support=auto --client 127.0.0.1 --port 45947 --file /home/buaa/songyue/lawin-master/tools/train.py Connected to pydev debugger (build 222.4345.23) fatal: not a git repository (or any of the parent directories): .git 2022-10-21 17:23:32,633 - mmseg - INFO - Environment info:

sys.platform: linux Python: 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:51:32) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] CUDA available: True GPU 0,1: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.3.r11.3/compiler.29920130_0 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.8.1+cu111 PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  • CuDNN 8.0.5
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, TorchVision: 0.9.1+cu111 OpenCV: 4.6.0 MMCV: 1.2.7 MMCV Compiler: GCC 7.5 MMCV CUDA Compiler: 11.3 MMSegmentation: 0.11.0+

INFO:mmseg:Environment info:

sys.platform: linux Python: 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:51:32) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] CUDA available: True GPU 0,1: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.3.r11.3/compiler.29920130_0 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.8.1+cu111 PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  • CuDNN 8.0.5
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, TorchVision: 0.9.1+cu111 OpenCV: 4.6.0 MMCV: 1.2.7 MMCV Compiler: GCC 7.5 MMCV CUDA Compiler: 11.3 MMSegmentation: 0.11.0+

2022-10-21 17:23:32,633 - mmseg - INFO - Distributed training: True INFO:mmseg:Distributed training: True 2022-10-21 17:23:33,165 - mmseg - INFO - Config: norm_cfg = dict(type='SyncBN', requires_grad=True) find_unused_parameters = True ................................................................................................................................................................................................................................................ 2022-10-21 17:23:34,101 - mmseg - INFO - Loaded 4750 images INFO:mmseg:Loaded 4750 images fatal: not a git repository (or any of the parent directories): .git 2022-10-21 17:23:36,849 - mmseg - INFO - Loaded 1188 images INFO:mmseg:Loaded 1188 images 2022-10-21 17:23:36,850 - mmseg - INFO - Start running, host: buaa@buaa-System-Product-Name, work_dir: /home/buaa/songyue/lawin-master/workdir INFO:mmseg:Start running, host: buaa@buaa-System-Product-Name, work_dir: /home/buaa/songyue/lawin-master/workdir 2022-10-21 17:23:36,850 - mmseg - INFO - workflow: [('train', 1)], max: 160000 iters INFO:mmseg:workflow: [('train', 1)], max: 160000 iters Traceback (most recent call last): File "/snap/pycharm-community/302/plugins/python-ce/helpers/pydev/pydevd.py", line 1496, in _exec pydev_imports.execfile(file, globals, locals) # execute the script File "/snap/pycharm-community/302/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "/home/buaa/songyue/lawin-master/tools/train.py", line 174, in main() File "/home/buaa/songyue/lawin-master/tools/train.py", line 170, in main meta=meta) File "/home/buaa/songyue/lawin-master/mmseg/apis/train.py", line 115, in train_segmentor runner.run(data_loaders, cfg.workflow) File "/home/buaa/anaconda3/envs/vit/lib/python3.6/site-packages/mmcv/runner/iter_based_runner.py", line 131, in run iter_runner(iter_loaders[i], **kwargs) File "/home/buaa/anaconda3/envs/vit/lib/python3.6/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train outputs = self.model.train_step(data_batch, self.optimizer, **kwargs) File "/home/buaa/anaconda3/envs/vit/lib/python3.6/site-packages/mmcv/parallel/distributed.py", line 46, in train_step output = self.module.train_step(*inputs[0], **kwargs[0]) File "/home/buaa/songyue/lawin-master/mmseg/models/segmentors/base.py", line 152, in train_step losses = self(**data_batch) File "/home/buaa/anaconda3/envs/vit/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/buaa/anaconda3/envs/vit/lib/python3.6/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func return old_func(*args, **kwargs) File "/home/buaa/songyue/lawin-master/mmseg/models/segmentors/base.py", line 122, in forward return self.forward_train(img, img_metas, **kwargs) File "/home/buaa/songyue/lawin-master/mmseg/models/segmentors/encoder_decoder.py", line 158, in forward_train gt_semantic_seg) File "/home/buaa/songyue/lawin-master/mmseg/models/segmentors/encoder_decoder.py", line 102, in _decode_head_forward_train self.train_cfg) File "/home/buaa/songyue/lawin-master/mmseg/models/decode_heads/decode_head.py", line 188, in forward_train seg_logits = self.forward(inputs) File "/home/buaa/songyue/lawin-master/mmseg/models/decode_heads/lawin_head.py", line 328, in forward abc = self.image_pool(_c) File "/home/buaa/anaconda3/envs/vit/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/buaa/anaconda3/envs/vit/lib/python3.6/site-packages/torch/nn/modules/container.py", line 119, in forward input = module(input) File "/home/buaa/anaconda3/envs/vit/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/buaa/anaconda3/envs/vit/lib/python3.6/site-packages/mmcv/cnn/bricks/conv_module.py", line 195, in forward x = self.norm(x) File "/home/buaa/anaconda3/envs/vit/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/buaa/anaconda3/envs/vit/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 539, in forward bn_training, exponential_average_factor, self.eps) File "/home/buaa/anaconda3/envs/vit/lib/python3.6/site-packages/torch/nn/functional.py", line 2147, in batch_norm _verify_batch_size(input.size()) File "/home/buaa/anaconda3/envs/vit/lib/python3.6/site-packages/torch/nn/functional.py", line 2114, in _verify_batch_size raise ValueError("Expected more than 1 value per channel when training, got input size {}".format(size)) ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 512, 1, 1]) python-BaseException Backend TkAgg is interactive backend. Turning interactive mode on.` We found that this should be the problem of batchsize being 1. But we don't know where to make the changes. We thought there was something wrong with the configuration that we weren't aware of. Can you give us some suggestions? Looking forward to your reply!

onionysy avatar Oct 21 '22 09:10 onionysy