deep-head-pose How yaw, roll and pitch are calculated for the landmark-based networks?

Hi and thanks for the great work :)

I have a question about the the table 1 and 2 in your paper. Some networks like 3DDFA and FAN can only detect 68 landmarks on the image. Could you please explain how you used that information to calculate yaw, roll, pitch? Is there a paper or mathematical way as a source to read?

Many thanks!

Apr 24 '19 13:04 cnaaq

landmarks_68_3D = np.array( [
[-73.393523  , -29.801432   , 47.667532   ],
[-72.775014  , -10.949766   , 45.909403   ],
[-70.533638  , 7.929818     , 44.842580   ],
[-66.850058  , 26.074280    , 43.141114   ],
[-59.790187  , 42.564390    , 38.635298   ],
[-48.368973  , 56.481080    , 30.750622   ],
[-34.121101  , 67.246992    , 18.456453   ],
[-17.875411  , 75.056892    , 3.609035    ],
[0.098749    , 77.061286    , -0.881698   ],
[17.477031   , 74.758448    , 5.181201    ],
[32.648966   , 66.929021    , 19.176563   ],
[46.372358   , 56.311389    , 30.770570   ],
[57.343480   , 42.419126    , 37.628629   ],
[64.388482   , 25.455880    , 40.886309   ],
[68.212038   , 6.990805     , 42.281449   ],
[70.486405   , -11.666193   , 44.142567   ],
[71.375822   , -30.365191   , 47.140426   ],
[-61.119406  , -49.361602   , 14.254422   ],
[-51.287588  , -58.769795   , 7.268147    ],
[-37.804800  , -61.996155   , 0.442051    ],
[-24.022754  , -61.033399   , -6.606501   ],
[-11.635713  , -56.686759   , -11.967398  ],
[12.056636   , -57.391033   , -12.051204  ],
[25.106256   , -61.902186   , -7.315098   ],
[38.338588   , -62.777713   , -1.022953   ],
[51.191007   , -59.302347   , 5.349435    ],
[60.053851   , -50.190255   , 11.615746   ],
[0.653940    , -42.193790   , -13.380835  ],
[0.804809    , -30.993721   , -21.150853  ],
[0.992204    , -19.944596   , -29.284036  ],
[1.226783    , -8.414541    , -36.948060  ],
[-14.772472  , 2.598255     , -20.132003  ],
[-7.180239   , 4.751589     , -23.536684  ],
[0.555920    , 6.562900     , -25.944448  ],
[8.272499    , 4.661005     , -23.695741  ],
[15.214351   , 2.643046     , -20.858157  ],
[-46.047290  , -37.471411   , 7.037989    ],
[-37.674688  , -42.730510   , 3.021217    ],
[-27.883856  , -42.711517   , 1.353629    ],
[-19.648268  , -36.754742   , -0.111088   ],
[-28.272965  , -35.134493   , -0.147273   ],
[-38.082418  , -34.919043   , 1.476612    ],
[19.265868   , -37.032306   , -0.665746   ],
[27.894191   , -43.342445   , 0.247660    ],
[37.437529   , -43.110822   , 1.696435    ],
[45.170805   , -38.086515   , 4.894163    ],
[38.196454   , -35.532024   , 0.282961    ],
[28.764989   , -35.484289   , -1.172675   ],
[-28.916267  , 28.612716    , -2.240310   ],
[-17.533194  , 22.172187    , -15.934335  ],
[-6.684590   , 19.029051    , -22.611355  ],
[0.381001    , 20.721118    , -23.748437  ],
[8.375443    , 19.035460    , -22.721995  ],
[18.876618   , 22.394109    , -15.610679  ],
[28.794412   , 28.079924    , -3.217393   ],
[19.057574   , 36.298248    , -14.987997  ],
[8.956375    , 39.634575    , -22.554245  ],
[0.381549    , 40.395647    , -23.591626  ],
[-7.428895   , 39.836405    , -22.406106  ],
[-18.160634  , 36.677899    , -15.121907  ],
[-24.377490  , 28.677771    , -4.785684   ],
[-6.897633   , 25.475976    , -20.893742  ],
[0.340663    , 26.014269    , -22.220479  ],
[8.444722    , 25.326198    , -21.025520  ],
[24.474473   , 28.323008    , -5.712776   ],
[8.449166    , 30.596216    , -20.671489  ],
[0.205322    , 31.408738    , -21.903670  ],
[-7.198266   , 30.844876    , -20.328022  ] ], dtype=np.float32)

def rotationMatrixToEulerAngles(R) :
    sy = math.sqrt(R[0,0] * R[0,0] +  R[1,0] * R[1,0])
    singular = sy < 1e-6
    if  not singular :
        x = math.atan2(R[2,1] , R[2,2])
        y = math.atan2(-R[2,0], sy)
        z = math.atan2(R[1,0], R[0,0])
    else :
        x = math.atan2(-R[1,2], R[1,1])
        y = math.atan2(-R[2,0], sy)
        z = 0
    return np.array([x, y, z])

#returns pitch,yaw,roll [-1...+1]
def estimate_pitch_yaw_roll(aligned_256px_landmarks):
    shape = (256,256)
    focal_length = shape[1]
    camera_center = (shape[1] / 2, shape[0] / 2)
    camera_matrix = np.array(
        [[focal_length, 0, camera_center[0]],
         [0, focal_length, camera_center[1]],
         [0, 0, 1]], dtype=np.float32)

    (_, rotation_vector, translation_vector) = cv2.solvePnP(
        landmarks_68_3D,
        aligned_256px_landmarks.astype(np.float32),
        camera_matrix,
        np.zeros((4, 1)) )

    pitch, yaw, roll = rotationMatrixToEulerAngles( cv2.Rodrigues(rotation_vector)[0] )
    pitch = np.clip ( pitch*1.25, -1.0, 1.0 )
    yaw = np.clip ( yaw*1.25, -1.0, 1.0 )
    roll = np.clip ( roll*1.25, -1.0, 1.0 )
    return pitch, yaw, roll

Apr 25 '19 04:04 iperov

Hi @iperov and thank you very much for your reply. I have two questions:

how did you make the camera_matrix? It should be different from case to case, right? or am I mistaken?
what is aligned_256px_landmarks argument?

Thanks in advance!

Apr 25 '19 15:04 cnaaq

its my function in Deepfacelab project. aligned_256px_landmarks is landmarks from aligned face in 256x256 image.

of course you may not to align.

Apr 25 '19 15:04 iperov

can we say that aligned_256px_landmarks == landmarks_68_3D[ 0:1, : ] ? just omitting z information

sorry but I didn't get the answer to camera_matrix. Is it only true for your camera?

Apr 25 '19 16:04 cnaaq

nevermind, use any landmarks you want

i have no camera, I am using faces from CelebA dataset.

Apr 25 '19 16:04 iperov

OK, I got the theory behind the landmarks 3D and 2D Just one more question: what is the idea behind this line? pitch = np.clip ( pitch*1.25, -1.0, 1.0 ) yaw = np.clip ( yaw*1.25, -1.0, 1.0 ) roll = np.clip ( roll*1.25, -1.0, 1.0 ) why angles between -1 & 1?

May 06 '19 10:05 cnaaq

fix it for yourself

May 06 '19 10:05 iperov

deep-head-pose deep-head-pose copied to clipboard

How yaw, roll and pitch are calculated for the landmark-based networks?

deep-head-pose
deep-head-pose copied to clipboard