mmpose [Bug] I used the coco dataset to reproduce rtmpose, and the acc_pose value has been hovering between 0.4 and 0.5, and has been trained for 120 epochs，How should I solve this bug?

[Bug] I used the coco dataset to reproduce rtmpose, and the acc_pose value has been hovering between 0.4 and 0.5, and has been trained for 120 epochs，How should I solve this bug?

Open goalinshi opened this issue 1 year ago • 0 comments

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmpose).

Environment

OrderedDict([('sys.platform', 'linux'), ('Python', '3.8.20 (default, Oct 3 2024, 15:24:27) [GCC 11.2.0]'), ('CUDA available', True), ('MUSA available', False), ('numpy_random_seed', 2147483648), ('GPU 0', 'NVIDIA GeForce RTX 3090'), ('CUDA_HOME', '/usr/local/cuda-11.8'), ('NVCC', 'Cuda compilation tools, release 11.8, V11.8.89'), ('GCC', 'gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0'), ('PyTorch', '2.0.1+cu117'), ('PyTorch compiling details', 'PyTorch built with:\n - GCC 9.3\n - C++ Version: 201703\n - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - LAPACK is enabled (usually provided by MKL)\n - NNPACK is enabled\n - CPU capability usage: AVX2\n - CUDA Runtime 11.7\n - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86\n - CuDNN 8.5\n - Magma 2.6.1\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n'), ('TorchVision', '0.15.2+cu117'), ('OpenCV', '4.10.0'), ('MMEngine', '0.10.5'), ('MMPose', '1.1.0+')])

Reproduces the problem - code sample

0.210927 loss_kpt: 0.210927 acc_pose: 0.470607 10/13 10:00:56 - mmengine - INFO - Epoch(train) [105][4200/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:31:06 time: 0.126481 data_time: 0.023456 memory: 3826 loss: 0.207607 loss_kpt: 0.207607 acc_pose: 0.455742 10/13 10:01:03 - mmengine - INFO - Epoch(train) [105][4250/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:31:00 time: 0.122026 data_time: 0.019312 memory: 3826 loss: 0.207197 loss_kpt: 0.207197 acc_pose: 0.522144 10/13 10:01:07 - mmengine - INFO - Exp name: rtmpose-l_8xb256-420e_coco-256x192_20241012_164653 10/13 10:01:09 - mmengine - INFO - Epoch(train) [105][4300/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:30:54 time: 0.121489 data_time: 0.018492 memory: 3826 loss: 0.210692 loss_kpt: 0.210692 acc_pose: 0.520829 10/13 10:01:15 - mmengine - INFO - Epoch(train) [105][4350/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:30:49 time: 0.130391 data_time: 0.027746 memory: 3826 loss: 0.207093 loss_kpt: 0.207093 acc_pose: 0.510383 10/13 10:01:21 - mmengine - INFO - Epoch(train) [105][4400/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:30:43 time: 0.121266 data_time: 0.018557 memory: 3826 loss: 0.208687 loss_kpt: 0.208687 acc_pose: 0.571073 10/13 10:01:27 - mmengine - INFO - Epoch(train) [105][4450/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:30:37 time: 0.120966 data_time: 0.018265 memory: 3826 loss: 0.207345 loss_kpt: 0.207345 acc_pose: 0.523733

Reproduces the problem - command or script

python train.py config configs/body_2d_keypoint/rtmpose/coco/rtmpose-l_8xb256-420e_coco-256x192.py
--resume work_dirs/cspnext-l_udp-aic-coco_210e-256x192-273b7631_20230130.pth

Reproduces the problem - error message

[4250/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:31:00 time: 0.122026 data_time: 0.019312 memory: 3826 loss: 0.207197 loss_kpt: 0.207197 acc_pose: 0.522144

Additional information

1.The dataset is based on the original COCO dataset with 2000 additional images. 2.I think the performance after adding data is close to the original given model； 3.I can't think of where the problem is. The data has been verified and there is no problem.

Oct 13 '24 02:10 goalinshi

mmpose mmpose copied to clipboard

[Bug] I used the coco dataset to reproduce rtmpose, and the acc_pose value has been hovering between 0.4 and 0.5, and has been trained for 120 epochs，How should I solve this bug?

Prerequisite

Environment

Reproduces the problem - code sample

Reproduces the problem - command or script

Reproduces the problem - error message

Additional information

mmpose
mmpose copied to clipboard