autoware.universe icon indicating copy to clipboard operation
autoware.universe copied to clipboard

feat(lidar_transfusion): add lidar_transfusion 3D detection package

Open amadeuszsz opened this issue 4 months ago • 14 comments

Description

The PR adds a new lidar_transfusion package which brings a new 3D detection component based on TransFusion[1].

https://github.com/autowarefoundation/autoware.universe/assets/37396312/0c2e3f64-b72f-4cd6-b72b-fa50525ea9b8

Related links

[1] Xuyang Bai, Zeyu Hu, Xinge Zhu, Qingqiu Huang, Yilun Chen, Hongbo Fu and Chiew-Lan Tai. "TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers." arXiv preprint arXiv:2203.11496 (2022) [2] rosbag (TIER IV internal link) [3] helper files (TIER IV internal link) [4] TransFusion onnx model (TIER IV internal link)

Tests performed

  • Model metrics:
------------- T4Metric results -------------
04/25 09:28:12 - mmengine - INFO - | class_name | mAP   | [email protected] | [email protected] | [email protected] | [email protected] | error@trans_err | error@scale_err | error@orient_err | error@vel_err | error@attr_err |
04/25 09:28:12 - mmengine - INFO - |----------------------------------------------------------------------------------------------------------------------------------------------------|
04/25 09:28:12 - mmengine - INFO - | car        | 0.788 | 0.609   | 0.798   | 0.859   | 0.884   | 0.26            | 0.159           | 0.103            | 0.779         | 1              |
04/25 09:28:12 - mmengine - INFO - | truck      | 0.59  | 0.315   | 0.577   | 0.7     | 0.766   | 0.379           | 0.149           | 0.0516           | 0.77          | 1              |
04/25 09:28:12 - mmengine - INFO - | bus        | 0.554 | 0.347   | 0.532   | 0.628   | 0.709   | 0.363           | 0.137           | 0.0762           | 1.69          | 1              |
04/25 09:28:12 - mmengine - INFO - | bicycle    | 0.236 | 0.192   | 0.241   | 0.251   | 0.261   | 0.283           | 0.248           | 0.27             | 0.839         | 1              |
04/25 09:28:12 - mmengine - INFO - | pedestrian | 0.628 | 0.57    | 0.613   | 0.642   | 0.688   | 0.253           | 0.281           | 0.424            | 0.626         | 1              |
04/25 09:28:12 - mmengine - INFO - | Total mAP: 0.559
  • ROS 2 node performance with RTX 4090:
                   | pre-process [ms] | inference [ms] | post-process [ms] | total [ms]  |
------------------------------------------------------------------------------------------
lidar_transfusion  |   1.52 ± 0.14    |  2.29 ± 0.37   |    0.20 ± 0.15    | 4.07 ± 0.44 |
lidar_centerpoint  |        -         |       -        |         -         | 5.53 ± 0.55 |

Notes for reviewers

The package can best tested with rosbag file. If data needed, you can use this rosbag[2] and copy helper files[3] to the launch directory of lidar_transfusion package before building. The default path for onnx model is ~/autoware_data/lidar_transfusion/transfusion.onnx. The model awaits for deployment, temporary please use attached link[4].

To start, run commands:

# terminal 1
ros2 launch lidar_transfusion data_pipeline.launch.py 

# terminal 2
ros2 bag play trailer_yaw_ticket_data_jpntaxi7/01d060d9-8d25-45e0-b45e-9fd7201ac27b_2023-05-26-11-30-03_p0900_3.db3 --loop --clock

# terminal 3
rviz2 --ros-args -p use_sim_time:=True 

# terminal 4 (for convenience you can add <param name="use_sim_time" value="true"/> to the node in xml file)
ros2 launch lidar_transfusion lidar_transfusion.launch.xml

Interface changes

Effects on system behavior

Pre-review checklist for the PR author

The PR author must check the checkboxes below when creating the PR.

In-review checklist for the PR reviewers

The PR reviewers must check the checkboxes below before approval.

  • [ ] The PR follows the pull request guidelines.
  • [ ] The PR has been properly tested.
  • [ ] The PR has been reviewed by the code owners.

Post-review checklist for the PR author

The PR author must check the checkboxes below before merging.

  • [ ] There are no open discussions or they are tracked via tickets.
  • [ ] The PR is ready for merge.

After all checkboxes are checked, anyone who has write access can merge the PR.

amadeuszsz avatar Apr 25 '24 10:04 amadeuszsz

Comments:

  • When the number of voxels falls outside the expected parameters, the output is 200 NaN objects (200 being the max number of objects). This can happen when the top lidar is missing in the taxi, for example. We should consider lowering the value, and/or aborting the inference + adding an error message.
  • We need to add transfusion to the perception related launchers. I did it in my local environment, so I will either pass them to the PR's author, commit them myself, or explain him the procedure.
  • In my environment, I have cuda memory errors sporadically. It happens inside the first kernel of the preprocess step.
  • Not confirmed, but the stream used in the preprocessing may not be the one we are expecting (since adding device synchronizations instead of streams changed the behavior)

knzo25 avatar May 02 '24 01:05 knzo25

@scepter914 The PR can still be reviewed to some degree, but I think it is better to wait until the errors disappear :pray:

knzo25 avatar May 02 '24 01:05 knzo25

@knzo25 I think you wanted to mention @scepter914

taikitanaka3 avatar May 02 '24 02:05 taikitanaka3

Apologies :bow:

knzo25 avatar May 02 '24 02:05 knzo25

Thank you @knzo25 for your review. Let me investigate the kernels first to find mentioned leak. Regarding the NaN output, as you suggested we will:

  • change the optimization profile,
  • handle the cases with not sufficient number of voxels after preprocessing for optimization profile.

amadeuszsz avatar May 02 '24 02:05 amadeuszsz

@knzo25 Recent fix solves the issue with cuda memory and NaNs. Please check if the issue disappeared on your machine as well.

amadeuszsz avatar May 02 '24 09:05 amadeuszsz

@amadeuszsz

I apologize for the inconvenience, would you fix DCO? We need to pass CI include DCO for merge (this specification is so invonvenient...).

scepter914 avatar May 08 '24 06:05 scepter914

@amadeuszsz I added transfusion to the other universe launchers. Just for the duration of the review, I set transfusion as the default model. However, before merging we should return it to centerpoint, since although working, there has not been a proper evaluation in all the respective places

knzo25 avatar May 10 '24 01:05 knzo25

@amadeuszsz Now web team uploads the model and the param files ! :rocket:

https://awf.ml.dev.web.auto/perception/models/transfusion/v1/transfusion.onnx https://awf.ml.dev.web.auto/perception/models/transfusion/v1/lidar_transfusion.param.yaml https://awf.ml.dev.web.auto/perception/models/transfusion/v1/detection_class_remapper.param.yaml

scepter914 avatar May 10 '24 05:05 scepter914

@scepter914 To keep the consistency we will rename config v1/lidar_transfusion.param.yaml -> v1/transfusion.param.yaml. Please wait for the deployment.

amadeuszsz avatar May 10 '24 06:05 amadeuszsz

@amadeuszsz While doing other task that was asked of me, I was looking at the whole perception task and saw the preprocessing being done in cpu in centerpoint. I kind of forgot you told me you were doing it already in gpu for transfusion so I ended up doing double work and implemented it myself there as well :bow:

When I sent the PR I remembered, and after comparing the implementation, while quite similar, they have different overall costs, so I wonder if you can change the implementation here so it has around the same number of operations:

Let K be the total number of frames used for inference and N the number of points in a pointcloud. Assuming that the queue if full, the preprocessing before proper voxelization cost would be as follows:

transfusion: enqueue:

  • N* copy(host->host) transform and concat:
  • move to gpu: KNcopy(host->device)
  • kernel: KNtransform(device->device)

(proposed) centerpoint: enqueue:

  • N * copy (host->device) transform and concat:
  • kernel: KNkernel(device->device)

knzo25 avatar May 13 '24 00:05 knzo25

@amadeuszsz After reading your kernels seriously this time, I realized that you went to great lengths to make the operations general. Usually, when writing kernels (cpu or gpu), you want to avoid all the not strictly needed branches and loops. Could you specialize the kernels so that the fastest implementation is used on runtime? (maybe leaving your generic version to the case that no fast implementation is available)

Also, since we control the driver, you usually would not need to pay much attention to reverting the endianness

knzo25 avatar May 13 '24 03:05 knzo25

A while ago (how time flies), it was decided that we would migrate towards this way of using parameters: https://github.com/orgs/autowarefoundation/discussions/3371

While not all the nodes have implemented this (in particular centerpoint), new nodes should comply with it. Can you implement this? (there are several old PRs if you need examples)

knzo25 avatar May 13 '24 04:05 knzo25

@amadeuszsz Reading the preprocessing -> inference -> postprocessing part, I see several syncs that are needed to check the error code and abort midway when needed. Are all of these necessary? As a reference, centerpoint does not. Would the speedup product of removing these be worth it?

(I know that the one related the the optimization plan is important unless we have min voxels = 1)

knzo25 avatar May 13 '24 05:05 knzo25