mmsegmentation icon indicating copy to clipboard operation
mmsegmentation copied to clipboard

wandb logger given None error if CheckpointHook interval is not greater than EvalHook interval

Open ConvMech opened this issue 3 years ago • 4 comments

Dear team,

Thank you for providing the wand logging hook, it has been very useful, however, there's a small bug that I found annoying and I would like to have it fixed. I can create a PR if you think this is a valid error

  1. What command or script did you run?

python tools/train.py configs/mobilenet_v3/lraspp_m-v3-d8_512x1024_320k_cityscapes.py

  1. Did you make any modifications on the code or config? Did you understand what you have modified?

yes, I made the checkpoint interval and the eval interval the save: checkpoint_config = dict(by_epoch=False, interval=100) evaluation = dict(interval=50, metric='mIoU', pre_eval=True)

  1. What dataset did you use? cityscapes

Environment

Python: 3.8.8 (default, Apr 13 2021, 19:58:26) [GCC 7.3.0]
CUDA available: True
GPU 0: Tesla V100-SXM2-16GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.4, V11.4.48
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.8.1+cu102
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.9.1+cu102
OpenCV: 4.5.2
MMCV: 1.6.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.2
MMSegmentation: 0.26.0+81a08e4

Error traceback

If applicable, paste the error trackback here.

Traceback (most recent call last):                                                                                                                      
  File "tools/train.py", line 242, in <module>                                                                                                          
    main()                                                                                                                                              
  File "tools/train.py", line 231, in main                                                                                                              
    train_segmentor(                                                                                                                                    
  File "/home/tommy/mmsegmentation/mmseg/apis/train.py", line 194, in train_segmentor                                                                   
    runner.run(data_loaders, cfg.workflow)                                                                                                              
  File "/home/tommy/anaconda3/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 144, in run                                           
    iter_runner(iter_loaders[i], **kwargs)                                                                                                              
  File "/home/tommy/anaconda3/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 70, in train                                          
    self.call_hook('after_train_iter')                                                                                                                  
  File "/home/tommy/anaconda3/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 317, in call_hook                                           
    getattr(hook, fn_name)(self)                                                                                                                        
  File "/home/tommy/anaconda3/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 263, in after_train_iter                               
    hook.after_train_iter(runner)                                                                                                                       
  File "/home/tommy/anaconda3/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 135, in wrapper                                              
    return func(*args, **kwargs)                                                                                                                        
  File "/home/tommy/mmsegmentation/mmseg/core/hook/wandblogger_hook.py", line 192, in after_train_iter                                                  
    **self._get_eval_results()                                                                                                                          
  File "/home/tommy/mmsegmentation/mmseg/core/hook/wandblogger_hook.py", line 234, in _get_eval_results                                                 
    eval_results = self.val_dataset.evaluate(                                                                                                           
  File "/home/tommy/mmsegmentation/mmseg/datasets/custom.py", line 433, in evaluate                                                                     
    ret_metrics = pre_eval_to_metrics(results, metric)                                                                                                  
  File "/home/tommy/mmsegmentation/mmseg/core/evaluation/metrics.py", line 317, in pre_eval_to_metrics
    pre_eval_results = tuple(zip(*pre_eval_results))
TypeError: type object argument after * must be an iterable, not NoneType

Bug fix

a simple fix can solve the issue. Either by making the ckpt_interval > than the evaluation interval or by adding another if statement:

the problem is here:

if (self.log_checkpoint
                and self.every_n_iters(runner, self.ckpt_interval)
                or (self.ckpt_hook.save_last and self.is_last_iter(runner))):
            if self.log_checkpoint_metadata and self.eval_hook:
                metadata = {
                    'iter': runner.iter + 1,
                    **self._get_eval_results()
                }
            else:
                metadata = None

during the ckpt_interval, the code in the wand logger tries to fetch the _get_eval_results from the by calling results = self.eval_hook.latest_results, however even when the priority of evalhook is higher, somehow when the ckpt_checkpoint has the same interval as the eval hook, the code above will still be called before any evaluation has been done. This will result in the None error.

my fix is:

change if self.log_checkpoint_metadata and self.eval_hook to if self.log_checkpoint_metadata and self.eval_hook and self.eval_hook.latest_results

looking forward to hearing from you!

ConvMech avatar Jul 29 '22 00:07 ConvMech

Hi @ConvMech, Thanks for your bug report, we also reproduced this error. And feel free to create a PR to fix this problem if you want.

xiexinch avatar Aug 01 '22 10:08 xiexinch

@xiexinch thank you for your testing, I will create a PR. should I post the PR link here once I finished it?

ConvMech avatar Aug 03 '22 22:08 ConvMech

Hey @ConvMech if you would like to make a PR, I would happily review it. :)

ayulockin avatar Aug 08 '22 15:08 ayulockin

@xiexinch thank you for your testing, I will create a PR. should I post the PR link here once I finished it?

Hi @ConvMech, sorry for the late reply, just need to attach this issue to your PR.

xiexinch avatar Aug 30 '22 02:08 xiexinch