Azure-Kinect-Sensor-SDK Coordinate transformation issue

Hello there, I'm doing some projects on Azure Kinect. These days I'm testing the accuracy of body tracking. Here is the method I use:

We record some videos with fake body models. And extract pictures from videos to get joints labeled.
Using body tracking SDK to track the body, and get every joints XYZ coordinates.
We use the labeled pixel coordinates from RGB as input value, and then we use function k4a_transformation_depth_image_to_color_camera, and k4a_calibration_2d_to_3d to transform the pixel coordinate into 3D coordinate. At last we can compare those values.

But the result we get it's not that good. Here is the joints information from body tracking SDK: Joint[0]: Position[mm]( -309.316, -55.1283, 1671.32 ), Confidence Level: 2 Joint[1]: Position[mm]( -118.537, -38.9546, 1654.44 ), Confidence Level: 2 Joint[2]: Position[mm]( 34.537, -27.0045, 1648.4 ), Confidence Level: 2 Joint[3]: Position[mm]( 265.414, -4.12863, 1618.72 ), Confidence Level: 2 Joint[4]: Position[mm]( 224.833, 27.8237, 1629.71 ), Confidence Level: 2 Joint[5]: Position[mm]( 179.943, 174.777, 1640.14 ), Confidence Level: 2 Joint[6]: Position[mm]( -80.5916, 316.022, 1668.55 ), Confidence Level: 2 Joint[7]: Position[mm]( -281.283, 300.754, 1520.47 ), Confidence Level: 2 Joint[8]: Position[mm]( -329.715, 298.246, 1428.76 ), Confidence Level: 2 Joint[9]: Position[mm]( -391.972, 278.401, 1332.42 ), Confidence Level: 2 Joint[10]: Position[mm]( -391.598, 295.947, 1460.87 ), Confidence Level: 2 Joint[11]: Position[mm]( 230.003, -43.8087, 1619.77 ), Confidence Level: 2 Joint[12]: Position[mm]( 221.758, -180.513, 1583.47 ), Confidence Level: 2 Joint[13]: Position[mm]( 43.188, -424.398, 1557.9 ), Confidence Level: 2 Joint[14]: Position[mm]( -171.535, -405.077, 1425.42 ), Confidence Level: 2 Joint[15]: Position[mm]( -264.369, -408.492, 1363.39 ), Confidence Level: 2 Joint[16]: Position[mm]( -318.712, -320.925, 1313.67 ), Confidence Level: 2 Joint[17]: Position[mm]( -269.369, -333.964, 1391.64 ), Confidence Level: 2 Joint[18]: Position[mm]( -316.369, 41.9458, 1684.62 ), Confidence Level: 2 Joint[19]: Position[mm]( -708.604, 58.7787, 1506.05 ), Confidence Level: 2 Joint[20]: Position[mm]( -1107.46, 11.7344, 1598.79 ), Confidence Level: 2 Joint[21]: Position[mm]( -1284.57, 36.9256, 1505.34 ), Confidence Level: 2 Joint[22]: Position[mm]( -302.956, -142.664, 1659.32 ), Confidence Level: 2 Joint[23]: Position[mm]( -680.931, -194.054, 1459.04 ), Confidence Level: 2 Joint[24]: Position[mm]( -1083.1, -236.503, 1558.78 ), Confidence Level: 2 Joint[25]: Position[mm]( -1246.95, -252.316, 1472.83 ), Confidence Level: 2 Joint[26]: Position[mm]( 349.1, 6.18479, 1593.51 ), Confidence Level: 2 Joint[27]: Position[mm]( 378.135, 48.3661, 1428.14 ), Confidence Level: 2 Joint[28]: Position[mm]( 413.125, 74.4936, 1469.73 ), Confidence Level: 2 Joint[29]: Position[mm]( 399.255, 106.994, 1601.65 ), Confidence Level: 2 Joint[30]: Position[mm]( 419.419, 20.894, 1453.59 ), Confidence Level: 2 Joint[31]: Position[mm]( 433.528, -64.5769, 1565.04 ), Confidence Level: 2

And here is the transformed labeled coordinates: Joint[0]: Position[mm]( -274.596, -12.1681, 1451.54 )org_depth: 1449 Joint[1]: Position[mm]( -161.392, -17.9969, 1546.41 )org_depth: 1544 Joint[2]: Position[mm]( 47.4067, -1.64799, 2177.4 )org_depth: 2170 Joint[3]: Position[mm]( 31.8102, 2.37145, -3.81127 )org_depth: 0 Joint[4]: Position[mm]( 303.621, 171.127, 2752.21 )org_depth: 2724 Joint[5]: Position[mm]( 157.824, 213.137, 1800.51 )org_depth: 1773 Joint[6]: Position[mm]( -121.214, 578.309, 2785.5 )org_depth: 2715 Joint[7]: Position[mm]( -584.836, 571.443, 2781.91 )org_depth: 2712 Joint[8]: Position[mm]( 31.8102, 2.37145, -3.81127 )org_depth: 0 Joint[9]: Position[mm]( 31.8102, 2.37145, -3.81127 )org_depth: 0 Joint[10]: Position[mm]( -797.593, 509.78, 2810.77 )org_depth: 2747 Joint[11]: Position[mm]( 210.646, -66.7472, 1744.33 )org_depth: 1746 Joint[12]: Position[mm]( 200.744, -177.343, 1553.91 )org_depth: 1568 Joint[13]: Position[mm]( 99.0491, -401.744, 1433.16 )org_depth: 1471 Joint[14]: Position[mm]( -333.32, -790.917, 2644.61 )org_depth: 2716 Joint[15]: Position[mm]( -450.385, -751.596, 2587.39 )org_depth: 2655 Joint[16]: Position[mm]( -647.554, -758.225, 2648.09 )org_depth: 2716 Joint[17]: Position[mm]( 31.8102, 2.37145, -3.81127 )org_depth: 0 Joint[18]: Position[mm]( -263.283, 95.6607, 1387.31 )org_depth: 1374 Joint[19]: Position[mm]( -720.545, 38.9658, 1416.76 )org_depth: 1409 Joint[20]: Position[mm]( -1269.49, 50.1684, 1697.57 )org_depth: 1687 Joint[21]: Position[mm]( -1483.72, 63.2647, 1681.9 )org_depth: 1670 Joint[22]: Position[mm]( -272.131, -127.709, 1427.49 )org_depth: 1437 Joint[23]: Position[mm]( -743.452, -177.219, 1438.59 )org_depth: 1453 Joint[24]: Position[mm]( 31.8102, 2.37145, -3.81127 )org_depth: 0 Joint[25]: Position[mm]( -1539.91, -317.924, 1716.79 )org_depth: 1744 Joint[26]: Position[mm]( 331.147, 33.2645, 1467.13 )org_depth: 1460 Joint[27]: Position[mm]( 441.111, 47.377, 1552 )org_depth: 1543 Joint[28]: Position[mm]( 433.547, 78.0642, 1456.66 )org_depth: 1445 Joint[29]: Position[mm]( 31.8102, 2.37145, -3.81127 )org_depth: 0 Joint[30]: Position[mm]( 452.54, 15.1049, 1498.39 )org_depth: 1493 Joint[31]: Position[mm]( 475.026, -62.5779, 1701.45 )org_depth: 1703

The value 0 we can understand, maybe in that position, the depth value is invalided, but the value over 2000, we are really confused. So could you please give me some information about it, and the code we write is based on example fastpointcloud. And the last question is about the Confidence Level, we want to know the details of confidence level, for instance, the relationship between mAP and confidence level. Thanks!

May 15 '20 03:05 jean1zhou

I would need to see photos of your setup to comment on the values >2000 mm.

Confidence levels are NONE - Joint is out of range. Only applies to hand joints. LOW - Joint was not returned by the DNN. Human model fitting predicted the joint location. MEDIUM - Joint was returned by the DNN. HIGH - Reserved.

We are investigating algorithms to differentiate between to accuracy numbers across MEDIMUM and HIGH i.e. HIGH = error < X cm and MEDIUM > X cm.

May 21 '20 16:05 qm13

I would need to see photos of your setup to comment on the values >2000 mm.

Confidence levels are NONE - Joint is out of range. Only applies to hand joints. LOW - Joint was not returned by the DNN. Human model fitting predicted the joint location. MEDIUM - Joint was returned by the DNN. HIGH - Reserved.

We are investigating algorithms to differentiate between to accuracy numbers across MEDIMUM and HIGH i.e. HIGH = error < X cm and MEDIUM > X cm.

Thanks for replying. We use k4arecorder to record the video, and the set up is 720P, 15FPS, WFOV unbined. And we use MKVToolNix and FFMPEG to extract RGB video and RGB photos to get labeled. I reviewed the documentation about coordinate system. And we get confused about 2D coordinate system. In fact, the coordinates we get are based on pixel coordinates, but the 2D coordinates of the camera seem to be different from pixel coordinates, because pixel coordinates are integers. If they are different, is there any way to convert them?

May 25 '20 02:05 jean1zhou

@jean1zhou I am not fully understanding your issue and have some additional clarifying questions.

Are you recording both RGB and depth in the videos?

What do you mean by joint labels? Is that the skeleton joints?

By xyz coordinates are you referring to the skeleton joint coordinates as returned by body tracking SDK?

What do you mean by the labeled pixel coordinates from RGB?

k4a_transformation_depth_image_to_color_camera takes the depth map from the depth sensor and computes a depth map for the RGB sensor. That means for each RGB pixel, now you have a corresponding depth value. How does this RGB depth map compare to the depth map from the depth sensor?

k4a_calibration_2d_to_3d takes one RGB pixel and one RGB depth value from the depth map obtained above to map that pixel back to the XYZ in the depth camera space. The z value obtained would be within a few mm from the input depth.

Which RGB 2d pixels are transformed back to 3D? You do not mention using the reverse API: k4a_calibration_3d_to_2d to project the XYZ joints into RGB pixels. Are you choosing different pixels?

2d coordinates are in pixels as described here: https://docs.microsoft.com/en-us/azure/kinect-dk/coordinate-systems.

Jun 11 '20 01:06 qm13

@jean1zhou I am not fully understanding your issue and have some additional clarifying questions.

Are you recording both RGB and depth in the videos?

What do you mean by joint labels? Is that the skeleton joints?

By xyz coordinates are you referring to the skeleton joint coordinates as returned by body tracking SDK?

What do you mean by the labeled pixel coordinates from RGB?

k4a_transformation_depth_image_to_color_camera takes the depth map from the depth sensor and computes a depth map for the RGB sensor. That means for each RGB pixel, now you have a corresponding depth value. How does this RGB depth map compare to the depth map from the depth sensor?

k4a_calibration_2d_to_3d takes one RGB pixel and one RGB depth value from the depth map obtained above to map that pixel back to the XYZ in the depth camera space. The z value obtained would be within a few mm from the input depth.

Which RGB 2d pixels are transformed back to 3D? You do not mention using the reverse API: k4a_calibration_3d_to_2d to project the XYZ joints into RGB pixels. Are you choosing different pixels?

2d coordinates are in pixels as described here: https://docs.microsoft.com/en-us/azure/kinect-dk/coordinate-systems.

I'm sorry I didn't clarify the questions.

We recorded both RGB and depth in videos using k4arecorder. And we use play back function and Body Tracking SDK to track those videos. The output is every skeleton joint's xyz coordinates.
Then I use MKVToolNix to split RGB videos from original videos, because the original videos has both RGB videos and depth videos. After that I use FFMPEG to extract RGB pictures from RGB videos we just splited FPS by FPS.
To extract RGB pictures from RGB videos is to label 32 skeleton joints manually. Because we want to compare the skeleton joints' coordinates from body tracking SDK with the skeleton joints' coordinates we labeled manually, to check the precision.
To compare those coordinates we have to transform the coordinates we labeled manually to 3D because it's pixels coordinates. And the coordinates from Body Tracking SDK is 3D.
We use coordinates we labeled manually as input value, then we use k4a_transformation_depth_image_to_color_camera, and k4a_calibration_2d_to_3d functions to transform the 2D coordinates to 3D.
We successfully transformed the 2D coordinates into 3D, however the result is not that good. We have no ideas why it looks like that. You can refer to the coordinates information above.
At last we tried k4a_calibration_3d_to_2d to project the XYZ joints into RGB pixels. The result is way better than 3D. The gap between the skeleton joints' coordinates we labeled and the skeleton joints' coordinates we transformed from body tracking SDK is not that big. I will copy the information above.

The coordinates (x,y) we labeled manually from RGB Photo：

510.323915, 441.259191 553.108201, 439.053309 614.417643, 439.494485 677.415946, 432.958884 656.760855, 467.288603 648.821502, 503.465074 595.892488, 559.494485 576.485183, 598.759191 547.815300, 613.759191 515.616816, 600.965074 546.933150, 583.759191 657.201930, 398.465074 662.053756, 360.965074 681.461061, 275.376838 681.019986, 303.612132 663.376981, 314.200368 644.410751, 296.112132 644.410751, 318.612132 512.529290, 490.670956 307.870434, 470.376838 165.403170, 458.023897 104.975879, 469.494485 511.206065, 392.288603 302.136458, 406.847426 158.345968, 407.288603 88.912189, 388.688811 690.747051, 435.262238 723.460026, 432.325175 741.494102, 440.716783 729.750983, 458.758741 736.041940, 415.541958 723.879423, 392.465035

The coordinates (x,y) from Body Tracking SDK transformed to 2D:

477.577, 422.961 541.43, 423.25 593.115, 423.923 665.82, 426.89 653.569, 438.127 640.078, 486.232 577.384, 564.553 542.743, 605.35 519.904, 604.048 491.604, 604.738 495.665, 603.196 654.563, 414.718 653.344, 369.738 677.487, 275.428 727.215, 269.173 762.261, 262.714 773.116, 255.782 754.914, 292.675 475.657, 456.271 326.204, 459.947 194.823, 459.723 101.325, 466.773 479.209, 393.465 323.128, 391.36 194.864, 397.834 110.253, 395.488 693.244, 428.248 712.914, 427.319 724.93, 436.414 717, 456.322 722.744, 417.191 714.174, 398.882

The first problem is that we used k4a_transformation_depth_image_to_color_camera, which means that each pixel has a corresponding depth information. We want to extract the depth information of the pixels we interested, such as the depth information of (100, 100) pixels, but at present we have not found the corresponding function to directly extract or output the depth information of the corresponding pixels.

The second problem is that we used k4a_calibration_2d_to_3d to transform the coordinates of skeleton joints that we manually labeled into 3D coordinates, but it is very different from the skeleton joints coordinates detected by body tracking SDK, especially the depth information. But when we use k4a_calibration_3d_to_2d, the difference is not big. So we want to figure out where the problem is.

Thanks.

Jun 17 '20 01:06 jean1zhou

Hi @jean1zhou , I am having troubles transforming the images with k4a_transformation_depth_image_to_color_camera I tried to convert the video in the playback mode but I couldn't do it. My only goal is to get the transformed image How did you exactly do it? Can you contact me here or via mail : [email protected] I appreciate your help!

Sep 11 '20 22:09 fatihoezdemir

I'm sorry, I disturbed you I use C# to develop the program. Does k4a_calibration_2d_to_3d and k4a_calibration_3d_to_2d have the version used by C# or VB.net?

Nov 12 '20 05:11 TomitaTseng

Hi i'm facing the same questions. Have you made any progress recently?

Feb 25 '21 08:02 wangwwwwwwv

@jean1zhou I am not fully understanding your issue and have some additional clarifying questions. Are you recording both RGB and depth in the videos? What do you mean by joint labels? Is that the skeleton joints? By xyz coordinates are you referring to the skeleton joint coordinates as returned by body tracking SDK? What do you mean by the labeled pixel coordinates from RGB? k4a_transformation_depth_image_to_color_camera takes the depth map from the depth sensor and computes a depth map for the RGB sensor. That means for each RGB pixel, now you have a corresponding depth value. How does this RGB depth map compare to the depth map from the depth sensor? k4a_calibration_2d_to_3d takes one RGB pixel and one RGB depth value from the depth map obtained above to map that pixel back to the XYZ in the depth camera space. The z value obtained would be within a few mm from the input depth. Which RGB 2d pixels are transformed back to 3D? You do not mention using the reverse API: k4a_calibration_3d_to_2d to project the XYZ joints into RGB pixels. Are you choosing different pixels? 2d coordinates are in pixels as described here: https://docs.microsoft.com/en-us/azure/kinect-dk/coordinate-systems.

I'm sorry I didn't clarify the questions.

We recorded both RGB and depth in videos using k4arecorder. And we use play back function and Body Tracking SDK to track those videos. The output is every skeleton joint's xyz coordinates.

Then I use MKVToolNix to split RGB videos from original videos, because the original videos has both RGB videos and depth videos. After that I use FFMPEG to extract RGB pictures from RGB videos we just splited FPS by FPS.

To extract RGB pictures from RGB videos is to label 32 skeleton joints manually. Because we want to compare the skeleton joints' coordinates from body tracking SDK with the skeleton joints' coordinates we labeled manually, to check the precision.

To compare those coordinates we have to transform the coordinates we labeled manually to 3D because it's pixels coordinates. And the coordinates from Body Tracking SDK is 3D.

We use coordinates we labeled manually as input value, then we use k4a_transformation_depth_image_to_color_camera, and k4a_calibration_2d_to_3d functions to transform the 2D coordinates to 3D.

We successfully transformed the 2D coordinates into 3D, however the result is not that good. We have no ideas why it looks like that. You can refer to the coordinates information above.

At last we tried k4a_calibration_3d_to_2d to project the XYZ joints into RGB pixels. The result is way better than 3D. The gap between the skeleton joints' coordinates we labeled and the skeleton joints' coordinates we transformed from body tracking SDK is not that big. I will copy the information above.

The coordinates (x,y) we labeled manually from RGB Photo：

510.323915, 441.259191 553.108201, 439.053309 614.417643, 439.494485 677.415946, 432.958884 656.760855, 467.288603 648.821502, 503.465074 595.892488, 559.494485 576.485183, 598.759191 547.815300, 613.759191 515.616816, 600.965074 546.933150, 583.759191 657.201930, 398.465074 662.053756, 360.965074 681.461061, 275.376838 681.019986, 303.612132 663.376981, 314.200368 644.410751, 296.112132 644.410751, 318.612132 512.529290, 490.670956 307.870434, 470.376838 165.403170, 458.023897 104.975879, 469.494485 511.206065, 392.288603 302.136458, 406.847426 158.345968, 407.288603 88.912189, 388.688811 690.747051, 435.262238 723.460026, 432.325175 741.494102, 440.716783 729.750983, 458.758741 736.041940, 415.541958 723.879423, 392.465035

The coordinates (x,y) from Body Tracking SDK transformed to 2D:

477.577, 422.961 541.43, 423.25 593.115, 423.923 665.82, 426.89 653.569, 438.127 640.078, 486.232 577.384, 564.553 542.743, 605.35 519.904, 604.048 491.604, 604.738 495.665, 603.196 654.563, 414.718 653.344, 369.738 677.487, 275.428 727.215, 269.173 762.261, 262.714 773.116, 255.782 754.914, 292.675 475.657, 456.271 326.204, 459.947 194.823, 459.723 101.325, 466.773 479.209, 393.465 323.128, 391.36 194.864, 397.834 110.253, 395.488 693.244, 428.248 712.914, 427.319 724.93, 436.414 717, 456.322 722.744, 417.191 714.174, 398.882

The first problem is that we used k4a_transformation_depth_image_to_color_camera, which means that each pixel has a corresponding depth information. We want to extract the depth information of the pixels we interested, such as the depth information of (100, 100) pixels, but at present we have not found the corresponding function to directly extract or output the depth information of the corresponding pixels.

The second problem is that we used k4a_calibration_2d_to_3d to transform the coordinates of skeleton joints that we manually labeled into 3D coordinates, but it is very different from the skeleton joints coordinates detected by body tracking SDK, especially the depth information. But when we use k4a_calibration_3d_to_2d, the difference is not big. So we want to figure out where the problem is.

Thanks.

Hello, I have the same problem recently. Could you tell me how you finally solved it?

Mar 06 '23 02:03 amazing-cc

Azure-Kinect-Sensor-SDK Azure-Kinect-Sensor-SDK copied to clipboard

Coordinate transformation issue

Azure-Kinect-Sensor-SDK
Azure-Kinect-Sensor-SDK copied to clipboard