MobileHumanPose copied to clipboard
Difference between results from inference and the paper
First, thanks for your great work.
I trained the model using script 'python --gpu 0-1 --backbone LPSKI' with Human3.6M and MPII datasets. The protocol is 1 and train epoch was 25.
And I tested the model with and the result is like below : Protocol 1 error (PA MPJPE) >> tot: 42.72 Directions: 37.63 Discussion: 39.01 Eating: 45.51 Greeting: 43.06 Phoning: 41.33 Posing: 41.10 Purchases: 35.78 Sitting: 43.50 SittingDown: 57.36 Smoking: 47.08 Photo: 51.04 Waiting: 38.32 Walking: 30.94 WalkDog: 46.21 WalkTogether: 38.39
I found the average MPJPE of Protocol1 on paper is 35.2 which is different from my result. Did I miss something to get the right result?? Like other settings in
Also, my train time was 16 hours with RTX2080 and the train time on paper is 3 days with 2 RTX titans. So I also wonder what makes time difference between my result and the paper.
Did you checked that the is initially set to extra small model (I reported three different types which are small, large, and extra small)? seems like 16 hours of training time is also seems that you used extra small model. Please let me know if you have further question.
Did you mean embedding_size in to change the type? It is 2048 and I checked the large model uses 2048 embedding channels on paper.
The other setting in is like below : class Config:
## dataset
# training set
# 3D: Human36M, MuCo
trainset_3d = ['Human36M']
trainset_2d = ['MPII']
# testing set
# Human36M, MuPoTS, MSCOCO
testset = 'Human36M'
## directory
cur_dir = osp.dirname(os.path.abspath(__file__))
root_dir = osp.join(cur_dir, '..')
data_dir = osp.join(root_dir, 'data')
output_dir = osp.join(root_dir, 'output')
model_dir = osp.join(output_dir, 'model_dump')
pretrain_dir = osp.join(output_dir, 'pre_train')
vis_dir = osp.join(output_dir, 'vis')
log_dir = osp.join(output_dir, 'log')
result_dir = osp.join(output_dir, 'result')
## input, output
input_shape = (256, 256)
output_shape = (input_shape[0]//8, input_shape[1]//8)
width_multiplier = 1.0
depth_dim = 32
bbox_3d_shape = (2000, 2000, 2000) # depth, height, width
pixel_mean = (0.485, 0.456, 0.406)
pixel_std = (0.229, 0.224, 0.225)
## training config
embedding_size = 2048
lr_dec_epoch = [17, 21]
end_epoch = 25
lr = 1e-3
lr_dec_factor = 10
batch_size = 64
## testing config
test_batch_size = 1
flip_test = True
use_gt_info = True
## others
num_thread = 20
gpu_ids = '0'
num_gpus = 1
continue_train = False
no you should try to change depth_dim 32 to 64 and also if there is error shows up then try to manage output_shape also maybe correct answer should be like
input_shape = (256, 256)
output_shape = (input_shape[0]//4, input_shape[1]//4)
width_multiplier = 1.0
depth_dim = 64
bbox_3d_shape = (2000, 2000, 2000) # depth, height, width
pixel_mean = (0.485, 0.456, 0.406)
pixel_std = (0.229, 0.224, 0.225)
It actually made an error like this :
File "/home/unolab/Yoonho/Pose/MobileHumanPose/main/", line 67, in forward loss_coord = torch.abs(coord - target_coord) * target_vis RuntimeError: The size of tensor a (8) must match the size of tensor b (64) at non-singleton dimension 0
I changed the output shape to solve the error : output_shape = (input_shape[0]//(8math.sqrt(2)), input_shape[1]//(8math.sqrt(2)))
And it made another error : File "/home/unolab/Yoonho/Pose/MobileHumanPose/main/", line 29, in soft_argmax heatmaps = heatmaps.reshape((-1, joint_num, cfg.depth_dim*cfg.output_shape[0]*cfg.output_shape[1])) TypeError: reshape(): argument 'shape' must be tuple of ints, but found element of type float at pos 3
@unoShin I found that there was slight mis-match of large model so I uploaded in large branch in case of skip concat. Please let me know if this won't work
@SangbumChoi Thank you :)
I checked it works for training and it will take about 50 hours with RTX 2080. Thank you for your support!
@unoShin Sounds great please close this issue (otherwise I will in shortly) when the score is similiar to the paper :)
Protocol 1 error (PA MPJPE) >> tot: 40.13 Directions: 34.42 Discussion: 35.80 Eating: 44.37 Greeting: 41.69 Phoning: 38.69 Posing: 37.26 Purchases: 34.61 Sitting: 42.46 SittingDown: 54.30 Smoking: 42.58 Photo: 48.37 Waiting: 35.72 Walking: 29.49 WalkDog: 43.04 WalkTogether: 35.15
I trained the model with new branch(large, 25epochs) and got the result like that. There still is difference between 40.13 and 35.2(on paper) in MPJPE. : trainset_3d = ['Human36M'] trainset_2d = ['MPII']
# testing set
# Human36M, MuPoTS, MSCOCO
testset = 'Human36M'
## directory
cur_dir = osp.dirname(os.path.abspath(__file__))
root_dir = osp.join(cur_dir, '..')
data_dir = osp.join(root_dir, 'data')
output_dir = osp.join(root_dir, 'output')
model_dir = osp.join(output_dir, 'model_dump')
pretrain_dir = osp.join(output_dir, 'pre_train')
vis_dir = osp.join(output_dir, 'vis')
log_dir = osp.join(output_dir, 'log')
result_dir = osp.join(output_dir, 'result')
## input, output
input_shape = (256, 256)
output_shape = (input_shape[0]//4, input_shape[1]//4)
width_multiplier = 1.0
depth_dim = 64
bbox_3d_shape = (2000, 2000, 2000) # depth, height, width
pixel_mean = (0.485, 0.456, 0.406)
pixel_std = (0.229, 0.224, 0.225)
## training config
embedding_size = 2048
lr_dec_epoch = [17, 21]
end_epoch = 25
lr = 1e-3
lr_dec_factor = 10
batch_size = 16
## testing config
test_batch_size = 16
flip_test = True
use_gt_info = True
## others
num_thread = 20
gpu_ids = '0'
num_gpus = 1
continue_train = False
And protocol is 1 and bbox root file is from Subject 11 (trained on subject 1,5,6,7,8,9). Did I do something wrong to have the wrong result?
Protocol 1 error (PA MPJPE) >> tot: 40.13 Directions: 34.42 Discussion: 35.80 Eating: 44.37 Greeting: 41.69 Phoning: 38.69 Posing: 37.26 Purchases: 34.61 Sitting: 42.46 SittingDown: 54.30 Smoking: 42.58 Photo: 48.37 Waiting: 35.72 Walking: 29.49 WalkDog: 43.04 WalkTogether: 35.15
I trained the model with new branch(large, 25epochs) and got the result like that. There still is difference between 40.13 and 35.2(on paper) in MPJPE. : trainset_3d = ['Human36M'] trainset_2d = ['MPII']
# testing set # Human36M, MuPoTS, MSCOCO testset = 'Human36M' ## directory cur_dir = osp.dirname(os.path.abspath(__file__)) root_dir = osp.join(cur_dir, '..') data_dir = osp.join(root_dir, 'data') output_dir = osp.join(root_dir, 'output') model_dir = osp.join(output_dir, 'model_dump') pretrain_dir = osp.join(output_dir, 'pre_train') vis_dir = osp.join(output_dir, 'vis') log_dir = osp.join(output_dir, 'log') result_dir = osp.join(output_dir, 'result') ## input, output input_shape = (256, 256) output_shape = (input_shape[0]//4, input_shape[1]//4) width_multiplier = 1.0 depth_dim = 64 bbox_3d_shape = (2000, 2000, 2000) # depth, height, width pixel_mean = (0.485, 0.456, 0.406) pixel_std = (0.229, 0.224, 0.225) ## training config embedding_size = 2048 lr_dec_epoch = [17, 21] end_epoch = 25 lr = 1e-3 lr_dec_factor = 10 batch_size = 16 ## testing config test_batch_size = 16 flip_test = True use_gt_info = True ## others num_thread = 20 gpu_ids = '0' num_gpus = 1 continue_train = False
And protocol is 1 and bbox root file is from Subject 11 (trained on subject 1,5,6,7,8,9). Did I do something wrong to have the wrong result?
Sorry for inconvenience. I was fool that I commit every intermediate progress on github so I just need to find those past commit. I will find you the appropriate large model code for everyone. The little thing that might concern is that batch_size due to individual gpu circumstances.
Just one thing that you can check right now is that whether if extra-small model scores same in the paper.
I will let you know if I find one.
@unoShin Can you try commit 70baeafff0d57ab74a72abedb30c12e739da18ec
@SangbumChoi I will try and let you know. Thank you.
@SangbumChoi Is there only difference in line 130 of
09-02 09:52:46 Protocol 1 error (PA MPJPE) >> tot: 40.21 Directions: 36.56 Discussion: 37.00 Eating: 42.51 Greeting: 41.39 Phoning: 38.17 Posing: 36.55 Purchases: 36.60 Sitting: 42.26 SittingDown: 55.09 Smoking: 41.85 Photo: 48.03 Waiting: 36.51 Walking: 29.55 WalkDog: 43.62 WalkTogether: 35.50
Using that commit version, the result still has some difference with the result of the paper.
09-02 09:52:46 Protocol 1 error (PA MPJPE) >> tot: 40.21 Directions: 36.56 Discussion: 37.00 Eating: 42.51 Greeting: 41.39 Phoning: 38.17 Posing: 36.55 Purchases: 36.60 Sitting: 42.26 SittingDown: 55.09 Smoking: 41.85 Photo: 48.03 Waiting: 36.51 Walking: 29.55 WalkDog: 43.62 WalkTogether: 35.50
Using that commit version, the result still has some difference with the result of the paper.
@unoShin Hi, I have two question for you
- Did you use exactly same commit branch that I told you?
- What was your batch size? and 2d dataset for Human3.6M?
if both answer seems reasonable than I will re-train my code to announce. It might take more than one week
@SangbumChoi Hi,
Yes, I used this version : So I asked you whether the difference is only line 130 of between large branch and 70baeaf.
Batch size is 8 and 2d dataset is MPII. And my training time is 2.21 hour/epoch with RTX 2080 and the total number of epochs is 25.
@unoShin I'm little bit concern that your batch size is different from original paper and code but let me re-check and share with you. Again this might takes some time
@SangbumChoi Thanks for your support :)
Hi bro, I train with Human36 and MPII while get an error 400+. And the vis output result on 2D looks not so bad. I do not build bbox root file and use gt bbox, I want to figure out why I get a wrong error, how do you gene the bbox root file?
@ggfresh it seems like getting error with more than 400+ might be causing with old-branch (see this issue). and also if the image file seems cropped than actually you don't have to build a gt and root bbox. However, you can generate bbox root file by object detection or RootNet (
@ggfresh it seems like getting error with more than 400+ might be causing with old-branch (see this issue). and also if the image file seems cropped than actually you don't have to build a gt and root bbox. However, you can generate bbox root file by object detection or RootNet (
thanks for the reply, which issue?
@SangbumChoi When I test the epoch 24,the error is big.
I have the same problem, while my train loss is norm I think.
while I have use epochs 24-24
@ggfresh it is very awkward that already current opened issue claims at least 40 MPJPE. Your description has lack of information to debug and find error. As you said training seems normal, and you might want to actually display the jpg file.
Sorry, after checking, it is found that my own data is inconsistent.
@unoShin I found that there was slight mis-match of large model so I uploaded in large branch in case of skip concat. Please let me know if this won't work
What was the uploaded large branch in case of skip concat? I couldn't find it on the code.