mmpose
mmpose copied to clipboard
[Bug] I used the coco dataset to reproduce rtmpose, and the acc_pose value has been hovering between 0.4 and 0.5, and has been trained for 120 epochs,How should I solve this bug?
Prerequisite
- [X] I have searched Issues and Discussions but cannot get the expected help.
- [X] The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmpose).
Environment
OrderedDict([('sys.platform', 'linux'), ('Python', '3.8.20 (default, Oct 3 2024, 15:24:27) [GCC 11.2.0]'), ('CUDA available', True), ('MUSA available', False), ('numpy_random_seed', 2147483648), ('GPU 0', 'NVIDIA GeForce RTX 3090'), ('CUDA_HOME', '/usr/local/cuda-11.8'), ('NVCC', 'Cuda compilation tools, release 11.8, V11.8.89'), ('GCC', 'gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0'), ('PyTorch', '2.0.1+cu117'), ('PyTorch compiling details', 'PyTorch built with:\n - GCC 9.3\n - C++ Version: 201703\n - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - LAPACK is enabled (usually provided by MKL)\n - NNPACK is enabled\n - CPU capability usage: AVX2\n - CUDA Runtime 11.7\n - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86\n - CuDNN 8.5\n - Magma 2.6.1\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n'), ('TorchVision', '0.15.2+cu117'), ('OpenCV', '4.10.0'), ('MMEngine', '0.10.5'), ('MMPose', '1.1.0+')])
Reproduces the problem - code sample
0.210927 loss_kpt: 0.210927 acc_pose: 0.470607 10/13 10:00:56 - mmengine - INFO - Epoch(train) [105][4200/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:31:06 time: 0.126481 data_time: 0.023456 memory: 3826 loss: 0.207607 loss_kpt: 0.207607 acc_pose: 0.455742 10/13 10:01:03 - mmengine - INFO - Epoch(train) [105][4250/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:31:00 time: 0.122026 data_time: 0.019312 memory: 3826 loss: 0.207197 loss_kpt: 0.207197 acc_pose: 0.522144 10/13 10:01:07 - mmengine - INFO - Exp name: rtmpose-l_8xb256-420e_coco-256x192_20241012_164653 10/13 10:01:09 - mmengine - INFO - Epoch(train) [105][4300/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:30:54 time: 0.121489 data_time: 0.018492 memory: 3826 loss: 0.210692 loss_kpt: 0.210692 acc_pose: 0.520829 10/13 10:01:15 - mmengine - INFO - Epoch(train) [105][4350/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:30:49 time: 0.130391 data_time: 0.027746 memory: 3826 loss: 0.207093 loss_kpt: 0.207093 acc_pose: 0.510383 10/13 10:01:21 - mmengine - INFO - Epoch(train) [105][4400/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:30:43 time: 0.121266 data_time: 0.018557 memory: 3826 loss: 0.208687 loss_kpt: 0.208687 acc_pose: 0.571073 10/13 10:01:27 - mmengine - INFO - Epoch(train) [105][4450/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:30:37 time: 0.120966 data_time: 0.018265 memory: 3826 loss: 0.207345 loss_kpt: 0.207345 acc_pose: 0.523733
Reproduces the problem - command or script
python train.py config configs/body_2d_keypoint/rtmpose/coco/rtmpose-l_8xb256-420e_coco-256x192.py
--resume work_dirs/cspnext-l_udp-aic-coco_210e-256x192-273b7631_20230130.pth
Reproduces the problem - error message
[4250/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:31:00 time: 0.122026 data_time: 0.019312 memory: 3826 loss: 0.207197 loss_kpt: 0.207197 acc_pose: 0.522144
Additional information
1.The dataset is based on the original COCO dataset with 2000 additional images. 2.I think the performance after adding data is close to the original given model; 3.I can't think of where the problem is. The data has been verified and there is no problem.