autoware.universe
autoware.universe copied to clipboard
feat(lidar_transfusion): add lidar_transfusion 3D detection package
Description
The PR adds a new lidar_transfusion
package which brings a new 3D detection component based on TransFusion[1].
https://github.com/autowarefoundation/autoware.universe/assets/37396312/0c2e3f64-b72f-4cd6-b72b-fa50525ea9b8
Related links
[1] Xuyang Bai, Zeyu Hu, Xinge Zhu, Qingqiu Huang, Yilun Chen, Hongbo Fu and Chiew-Lan Tai. "TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers." arXiv preprint arXiv:2203.11496 (2022) [2] rosbag (TIER IV internal link) [3] helper files (TIER IV internal link) [4] TransFusion onnx model (TIER IV internal link)
Tests performed
- Model metrics:
------------- T4Metric results -------------
04/25 09:28:12 - mmengine - INFO - | class_name | mAP | [email protected] | [email protected] | [email protected] | [email protected] | error@trans_err | error@scale_err | error@orient_err | error@vel_err | error@attr_err |
04/25 09:28:12 - mmengine - INFO - |----------------------------------------------------------------------------------------------------------------------------------------------------|
04/25 09:28:12 - mmengine - INFO - | car | 0.788 | 0.609 | 0.798 | 0.859 | 0.884 | 0.26 | 0.159 | 0.103 | 0.779 | 1 |
04/25 09:28:12 - mmengine - INFO - | truck | 0.59 | 0.315 | 0.577 | 0.7 | 0.766 | 0.379 | 0.149 | 0.0516 | 0.77 | 1 |
04/25 09:28:12 - mmengine - INFO - | bus | 0.554 | 0.347 | 0.532 | 0.628 | 0.709 | 0.363 | 0.137 | 0.0762 | 1.69 | 1 |
04/25 09:28:12 - mmengine - INFO - | bicycle | 0.236 | 0.192 | 0.241 | 0.251 | 0.261 | 0.283 | 0.248 | 0.27 | 0.839 | 1 |
04/25 09:28:12 - mmengine - INFO - | pedestrian | 0.628 | 0.57 | 0.613 | 0.642 | 0.688 | 0.253 | 0.281 | 0.424 | 0.626 | 1 |
04/25 09:28:12 - mmengine - INFO - | Total mAP: 0.559
- ROS 2 node performance with RTX 4090:
| pre-process [ms] | inference [ms] | post-process [ms] | total [ms] |
------------------------------------------------------------------------------------------
lidar_transfusion | 1.52 ± 0.14 | 2.29 ± 0.37 | 0.20 ± 0.15 | 4.07 ± 0.44 |
lidar_centerpoint | - | - | - | 5.53 ± 0.55 |
Notes for reviewers
The package can best tested with rosbag file. If data needed, you can use this rosbag[2] and copy helper files[3] to the launch
directory of lidar_transfusion
package before building. The default path for onnx
model is ~/autoware_data/lidar_transfusion/transfusion.onnx
. The model awaits for deployment, temporary please use attached link[4].
To start, run commands:
# terminal 1
ros2 launch lidar_transfusion data_pipeline.launch.py
# terminal 2
ros2 bag play trailer_yaw_ticket_data_jpntaxi7/01d060d9-8d25-45e0-b45e-9fd7201ac27b_2023-05-26-11-30-03_p0900_3.db3 --loop --clock
# terminal 3
rviz2 --ros-args -p use_sim_time:=True
# terminal 4 (for convenience you can add <param name="use_sim_time" value="true"/> to the node in xml file)
ros2 launch lidar_transfusion lidar_transfusion.launch.xml
Interface changes
Effects on system behavior
Pre-review checklist for the PR author
The PR author must check the checkboxes below when creating the PR.
- [x] I've confirmed the contribution guidelines.
- [x] The PR follows the pull request guidelines.
In-review checklist for the PR reviewers
The PR reviewers must check the checkboxes below before approval.
- [ ] The PR follows the pull request guidelines.
- [ ] The PR has been properly tested.
- [ ] The PR has been reviewed by the code owners.
Post-review checklist for the PR author
The PR author must check the checkboxes below before merging.
- [ ] There are no open discussions or they are tracked via tickets.
- [ ] The PR is ready for merge.
After all checkboxes are checked, anyone who has write access can merge the PR.
Comments:
- When the number of voxels falls outside the expected parameters, the output is 200 NaN objects (200 being the max number of objects). This can happen when the top lidar is missing in the taxi, for example. We should consider lowering the value, and/or aborting the inference + adding an error message.
- We need to add transfusion to the perception related launchers. I did it in my local environment, so I will either pass them to the PR's author, commit them myself, or explain him the procedure.
- In my environment, I have cuda memory errors sporadically. It happens inside the first kernel of the preprocess step.
- Not confirmed, but the stream used in the preprocessing may not be the one we are expecting (since adding device synchronizations instead of streams changed the behavior)
@scepter914 The PR can still be reviewed to some degree, but I think it is better to wait until the errors disappear :pray:
@knzo25 I think you wanted to mention @scepter914
Apologies :bow:
Thank you @knzo25 for your review. Let me investigate the kernels first to find mentioned leak. Regarding the NaN output, as you suggested we will:
- change the optimization profile,
- handle the cases with not sufficient number of voxels after preprocessing for optimization profile.
@knzo25 Recent fix solves the issue with cuda memory and NaNs. Please check if the issue disappeared on your machine as well.
@amadeuszsz
I apologize for the inconvenience, would you fix DCO? We need to pass CI include DCO for merge (this specification is so invonvenient...).
@amadeuszsz I added transfusion to the other universe launchers. Just for the duration of the review, I set transfusion as the default model. However, before merging we should return it to centerpoint, since although working, there has not been a proper evaluation in all the respective places
@amadeuszsz Now web team uploads the model and the param files ! :rocket:
https://awf.ml.dev.web.auto/perception/models/transfusion/v1/transfusion.onnx https://awf.ml.dev.web.auto/perception/models/transfusion/v1/lidar_transfusion.param.yaml https://awf.ml.dev.web.auto/perception/models/transfusion/v1/detection_class_remapper.param.yaml
@scepter914
To keep the consistency we will rename config v1/lidar_transfusion.param.yaml
-> v1/transfusion.param.yaml
. Please wait for the deployment.
@amadeuszsz While doing other task that was asked of me, I was looking at the whole perception task and saw the preprocessing being done in cpu in centerpoint. I kind of forgot you told me you were doing it already in gpu for transfusion so I ended up doing double work and implemented it myself there as well :bow:
When I sent the PR I remembered, and after comparing the implementation, while quite similar, they have different overall costs, so I wonder if you can change the implementation here so it has around the same number of operations:
Let K
be the total number of frames used for inference and N
the number of points in a pointcloud.
Assuming that the queue if full, the preprocessing before proper voxelization cost would be as follows:
transfusion: enqueue:
- N* copy(host->host) transform and concat:
- move to gpu: KNcopy(host->device)
- kernel: KNtransform(device->device)
(proposed) centerpoint: enqueue:
- N * copy (host->device) transform and concat:
- kernel: KNkernel(device->device)
@amadeuszsz After reading your kernels seriously this time, I realized that you went to great lengths to make the operations general. Usually, when writing kernels (cpu or gpu), you want to avoid all the not strictly needed branches and loops. Could you specialize the kernels so that the fastest implementation is used on runtime? (maybe leaving your generic version to the case that no fast implementation is available)
Also, since we control the driver, you usually would not need to pay much attention to reverting the endianness
A while ago (how time flies), it was decided that we would migrate towards this way of using parameters: https://github.com/orgs/autowarefoundation/discussions/3371
While not all the nodes have implemented this (in particular centerpoint), new nodes should comply with it. Can you implement this? (there are several old PRs if you need examples)
@amadeuszsz Reading the preprocessing -> inference -> postprocessing part, I see several syncs that are needed to check the error code and abort midway when needed. Are all of these necessary? As a reference, centerpoint does not. Would the speedup product of removing these be worth it?
(I know that the one related the the optimization plan is important unless we have min voxels = 1)
@amadeuszsz The person who "resolves" the comments is the one who writes them. When they are addressed, please paste the commit in which it was done. If there are several changes, it is appreciated if you split them in different commits
@amadeuszsz A comment just in case for future reference. One of the good points of uniform intialization is that it is guaranteed to assign the correct zero to the primitive types and nullptr for pointers when using empty curly braces {}
@knzo25
While doing other task that was asked of me, I was looking at the whole perception task and saw the preprocessing being done in cpu in centerpoint. I kind of forgot you told me you were doing it already in gpu for transfusion so I ended up doing double work and implemented it myself there as well 🙇
When I sent the PR I remembered, and after comparing the implementation, while quite similar, they have different overall costs, so I wonder if you can change the implementation here so it has around the same number of operations:
Let
K
be the total number of frames used for inference andN
the number of points in a pointcloud. Assuming that the queue if full, the preprocessing before proper voxelization cost would be as follows:transfusion: enqueue:
- N* copy(host->host) transform and concat:
- move to gpu: K_N_copy(host->device)
- kernel: K_N_transform(device->device)
(proposed) centerpoint: enqueue:
- N * copy (host->device) transform and concat:
- kernel: K_N_kernel(device->device)
I modified the code as you suggested. The preprocessing decreased from 2.15 ± 0.55 to 1.21 ± 0.25 [ms] with two clouds in cache (performance differs from PR's init benchmark due to different load). The breaking changes make harder to keep host processing (you predict that before 😃), therefore I just removed it.
After reading your kernels seriously this time, I realized that you went to great lengths to make the operations general. Usually, when writing kernels (cpu or gpu), you want to avoid all the not strictly needed branches and loops. Could you specialize the kernels so that the fastest implementation is used on runtime? (maybe leaving your generic version to the case that no fast implementation is available)
Also, since we control the driver, you usually would not need to pay much attention to reverting the endianness
Changed. Now the package is not compatible with other point cloud formats. However, there is init validation and we throw exception with appropriate logs.
https://github.com/autowarefoundation/autoware.universe/pull/6890/commits/98e31a99191553352ebc8f94a372f89e7ae348f8 Seems it's last requested modification. Since it was a big change, apart of code review please test it again during the runtime.
@YoshiRi Due to having touched the launchers, we also need a review from one of the maintainers of that package. Since it is only a few lines, could you make a short review of that part?