mmsegmentation
mmsegmentation copied to clipboard
wandb logger given None error if CheckpointHook interval is not greater than EvalHook interval
Dear team,
Thank you for providing the wand logging hook, it has been very useful, however, there's a small bug that I found annoying and I would like to have it fixed. I can create a PR if you think this is a valid error
- What command or script did you run?
python tools/train.py configs/mobilenet_v3/lraspp_m-v3-d8_512x1024_320k_cityscapes.py
- Did you make any modifications on the code or config? Did you understand what you have modified?
yes, I made the checkpoint interval and the eval interval the save: checkpoint_config = dict(by_epoch=False, interval=100) evaluation = dict(interval=50, metric='mIoU', pre_eval=True)
- What dataset did you use? cityscapes
Environment
Python: 3.8.8 (default, Apr 13 2021, 19:58:26) [GCC 7.3.0]
CUDA available: True
GPU 0: Tesla V100-SXM2-16GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.4, V11.4.48
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.8.1+cu102
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 10.2
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
- CuDNN 7.6.5
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
TorchVision: 0.9.1+cu102
OpenCV: 4.5.2
MMCV: 1.6.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.2
MMSegmentation: 0.26.0+81a08e4
Error traceback
If applicable, paste the error trackback here.
Traceback (most recent call last):
File "tools/train.py", line 242, in <module>
main()
File "tools/train.py", line 231, in main
train_segmentor(
File "/home/tommy/mmsegmentation/mmseg/apis/train.py", line 194, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/home/tommy/anaconda3/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 144, in run
iter_runner(iter_loaders[i], **kwargs)
File "/home/tommy/anaconda3/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 70, in train
self.call_hook('after_train_iter')
File "/home/tommy/anaconda3/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 317, in call_hook
getattr(hook, fn_name)(self)
File "/home/tommy/anaconda3/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 263, in after_train_iter
hook.after_train_iter(runner)
File "/home/tommy/anaconda3/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 135, in wrapper
return func(*args, **kwargs)
File "/home/tommy/mmsegmentation/mmseg/core/hook/wandblogger_hook.py", line 192, in after_train_iter
**self._get_eval_results()
File "/home/tommy/mmsegmentation/mmseg/core/hook/wandblogger_hook.py", line 234, in _get_eval_results
eval_results = self.val_dataset.evaluate(
File "/home/tommy/mmsegmentation/mmseg/datasets/custom.py", line 433, in evaluate
ret_metrics = pre_eval_to_metrics(results, metric)
File "/home/tommy/mmsegmentation/mmseg/core/evaluation/metrics.py", line 317, in pre_eval_to_metrics
pre_eval_results = tuple(zip(*pre_eval_results))
TypeError: type object argument after * must be an iterable, not NoneType
Bug fix
a simple fix can solve the issue. Either by making the ckpt_interval > than the evaluation interval or by adding another if statement:
the problem is here:
if (self.log_checkpoint
and self.every_n_iters(runner, self.ckpt_interval)
or (self.ckpt_hook.save_last and self.is_last_iter(runner))):
if self.log_checkpoint_metadata and self.eval_hook:
metadata = {
'iter': runner.iter + 1,
**self._get_eval_results()
}
else:
metadata = None
during the ckpt_interval, the code in the wand logger tries to fetch the _get_eval_results from the by calling results = self.eval_hook.latest_results, however even when the priority of evalhook is higher, somehow when the ckpt_checkpoint has the same interval as the eval hook, the code above will still be called before any evaluation has been done. This will result in the None error.
my fix is:
change
if self.log_checkpoint_metadata and self.eval_hook
to
if self.log_checkpoint_metadata and self.eval_hook and self.eval_hook.latest_results
looking forward to hearing from you!
Hi @ConvMech, Thanks for your bug report, we also reproduced this error. And feel free to create a PR to fix this problem if you want.
@xiexinch thank you for your testing, I will create a PR. should I post the PR link here once I finished it?
Hey @ConvMech if you would like to make a PR, I would happily review it. :)
@xiexinch thank you for your testing, I will create a PR. should I post the PR link here once I finished it?
Hi @ConvMech, sorry for the late reply, just need to attach this issue to your PR.