autoware.universe feat(lidar_transfusion): add lidar

Description

The PR adds a new lidar_transfusion package which brings a new 3D detection component based on TransFusion[1].

https://github.com/autowarefoundation/autoware.universe/assets/37396312/0c2e3f64-b72f-4cd6-b72b-fa50525ea9b8

Tests performed

Model metrics:

------------- T4Metric results -------------
04/25 09:28:12 - mmengine - INFO - | class_name | mAP   | [email protected] | [email protected] | [email protected] | [email protected] | error@trans_err | error@scale_err | error@orient_err | error@vel_err | error@attr_err |
04/25 09:28:12 - mmengine - INFO - |----------------------------------------------------------------------------------------------------------------------------------------------------|
04/25 09:28:12 - mmengine - INFO - | car        | 0.788 | 0.609   | 0.798   | 0.859   | 0.884   | 0.26            | 0.159           | 0.103            | 0.779         | 1              |
04/25 09:28:12 - mmengine - INFO - | truck      | 0.59  | 0.315   | 0.577   | 0.7     | 0.766   | 0.379           | 0.149           | 0.0516           | 0.77          | 1              |
04/25 09:28:12 - mmengine - INFO - | bus        | 0.554 | 0.347   | 0.532   | 0.628   | 0.709   | 0.363           | 0.137           | 0.0762           | 1.69          | 1              |
04/25 09:28:12 - mmengine - INFO - | bicycle    | 0.236 | 0.192   | 0.241   | 0.251   | 0.261   | 0.283           | 0.248           | 0.27             | 0.839         | 1              |
04/25 09:28:12 - mmengine - INFO - | pedestrian | 0.628 | 0.57    | 0.613   | 0.642   | 0.688   | 0.253           | 0.281           | 0.424            | 0.626         | 1              |
04/25 09:28:12 - mmengine - INFO - | Total mAP: 0.559

ROS 2 node performance with RTX 4090:

                   | pre-process [ms] | inference [ms] | post-process [ms] | total [ms]  |
------------------------------------------------------------------------------------------
lidar_transfusion  |   1.52 ± 0.14    |  2.29 ± 0.37   |    0.20 ± 0.15    | 4.07 ± 0.44 |
lidar_centerpoint  |        -         |       -        |         -         | 5.53 ± 0.55 |

Notes for reviewers

The package can best tested with rosbag file. If data needed, you can use this rosbag[2] and copy helper files[3] to the launch directory of lidar_transfusion package before building. The default path for onnx model is ~/autoware_data/lidar_transfusion/transfusion.onnx. The model awaits for deployment, temporary please use attached link[4].

To start, run commands:

# terminal 1
ros2 launch lidar_transfusion data_pipeline.launch.py 

# terminal 2
ros2 bag play trailer_yaw_ticket_data_jpntaxi7/01d060d9-8d25-45e0-b45e-9fd7201ac27b_2023-05-26-11-30-03_p0900_3.db3 --loop --clock

# terminal 3
rviz2 --ros-args -p use_sim_time:=True 

# terminal 4 (for convenience you can add <param name="use_sim_time" value="true"/> to the node in xml file)
ros2 launch lidar_transfusion lidar_transfusion.launch.xml

Interface changes

Effects on system behavior

Pre-review checklist for the PR author

The PR author must check the checkboxes below when creating the PR.

[x] I've confirmed the contribution guidelines.
[x] The PR follows the pull request guidelines.

In-review checklist for the PR reviewers

The PR reviewers must check the checkboxes below before approval.

[ ] The PR follows the pull request guidelines.
[ ] The PR has been properly tested.
[ ] The PR has been reviewed by the code owners.

Post-review checklist for the PR author

The PR author must check the checkboxes below before merging.

[ ] There are no open discussions or they are tracked via tickets.
[ ] The PR is ready for merge.

After all checkboxes are checked, anyone who has write access can merge the PR.

Apr 25 '24 10:04 amadeuszsz

Comments:

When the number of voxels falls outside the expected parameters, the output is 200 NaN objects (200 being the max number of objects). This can happen when the top lidar is missing in the taxi, for example. We should consider lowering the value, and/or aborting the inference + adding an error message.
We need to add transfusion to the perception related launchers. I did it in my local environment, so I will either pass them to the PR's author, commit them myself, or explain him the procedure.
In my environment, I have cuda memory errors sporadically. It happens inside the first kernel of the preprocess step.
Not confirmed, but the stream used in the preprocessing may not be the one we are expecting (since adding device synchronizations instead of streams changed the behavior)

May 02 '24 01:05 knzo25

@scepter914 The PR can still be reviewed to some degree, but I think it is better to wait until the errors disappear :pray:

May 02 '24 01:05 knzo25

@knzo25 I think you wanted to mention @scepter914

May 02 '24 02:05 taikitanaka3

Apologies :bow:

May 02 '24 02:05 knzo25

Thank you @knzo25 for your review. Let me investigate the kernels first to find mentioned leak. Regarding the NaN output, as you suggested we will:

change the optimization profile,
handle the cases with not sufficient number of voxels after preprocessing for optimization profile.

May 02 '24 02:05 amadeuszsz

@knzo25 Recent fix solves the issue with cuda memory and NaNs. Please check if the issue disappeared on your machine as well.

May 02 '24 09:05 amadeuszsz

@amadeuszsz

I apologize for the inconvenience, would you fix DCO? We need to pass CI include DCO for merge (this specification is so invonvenient...).

May 08 '24 06:05 scepter914

@amadeuszsz I added transfusion to the other universe launchers. Just for the duration of the review, I set transfusion as the default model. However, before merging we should return it to centerpoint, since although working, there has not been a proper evaluation in all the respective places

May 10 '24 01:05 knzo25

@amadeuszsz Now web team uploads the model and the param files ! :rocket:

https://awf.ml.dev.web.auto/perception/models/transfusion/v1/transfusion.onnx https://awf.ml.dev.web.auto/perception/models/transfusion/v1/lidar_transfusion.param.yaml https://awf.ml.dev.web.auto/perception/models/transfusion/v1/detection_class_remapper.param.yaml

May 10 '24 05:05 scepter914

@scepter914 To keep the consistency we will rename config v1/lidar_transfusion.param.yaml -> v1/transfusion.param.yaml. Please wait for the deployment.

May 10 '24 06:05 amadeuszsz

@amadeuszsz While doing other task that was asked of me, I was looking at the whole perception task and saw the preprocessing being done in cpu in centerpoint. I kind of forgot you told me you were doing it already in gpu for transfusion so I ended up doing double work and implemented it myself there as well :bow:

When I sent the PR I remembered, and after comparing the implementation, while quite similar, they have different overall costs, so I wonder if you can change the implementation here so it has around the same number of operations:

Let K be the total number of frames used for inference and N the number of points in a pointcloud. Assuming that the queue if full, the preprocessing before proper voxelization cost would be as follows:

transfusion: enqueue:

N* copy(host->host) transform and concat:
move to gpu: KNcopy(host->device)
kernel: KNtransform(device->device)

(proposed) centerpoint: enqueue:

N * copy (host->device) transform and concat:
kernel: KNkernel(device->device)

May 13 '24 00:05 knzo25

@amadeuszsz After reading your kernels seriously this time, I realized that you went to great lengths to make the operations general. Usually, when writing kernels (cpu or gpu), you want to avoid all the not strictly needed branches and loops. Could you specialize the kernels so that the fastest implementation is used on runtime? (maybe leaving your generic version to the case that no fast implementation is available)

Also, since we control the driver, you usually would not need to pay much attention to reverting the endianness

May 13 '24 03:05 knzo25

A while ago (how time flies), it was decided that we would migrate towards this way of using parameters: https://github.com/orgs/autowarefoundation/discussions/3371

While not all the nodes have implemented this (in particular centerpoint), new nodes should comply with it. Can you implement this? (there are several old PRs if you need examples)

May 13 '24 04:05 knzo25

@amadeuszsz Reading the preprocessing -> inference -> postprocessing part, I see several syncs that are needed to check the error code and abort midway when needed. Are all of these necessary? As a reference, centerpoint does not. Would the speedup product of removing these be worth it?

(I know that the one related the the optimization plan is important unless we have min voxels = 1)

May 13 '24 05:05 knzo25

@amadeuszsz The person who "resolves" the comments is the one who writes them. When they are addressed, please paste the commit in which it was done. If there are several changes, it is appreciated if you split them in different commits

May 23 '24 08:05 knzo25

@amadeuszsz A comment just in case for future reference. One of the good points of uniform intialization is that it is guaranteed to assign the correct zero to the primitive types and nullptr for pointers when using empty curly braces {}

May 23 '24 10:05 knzo25

@knzo25

While doing other task that was asked of me, I was looking at the whole perception task and saw the preprocessing being done in cpu in centerpoint. I kind of forgot you told me you were doing it already in gpu for transfusion so I ended up doing double work and implemented it myself there as well 🙇

When I sent the PR I remembered, and after comparing the implementation, while quite similar, they have different overall costs, so I wonder if you can change the implementation here so it has around the same number of operations:

Let K be the total number of frames used for inference and N the number of points in a pointcloud. Assuming that the queue if full, the preprocessing before proper voxelization cost would be as follows:

transfusion: enqueue:

N* copy(host->host) transform and concat:

move to gpu: K_N_copy(host->device)

kernel: K_N_transform(device->device)

(proposed) centerpoint: enqueue:

N * copy (host->device) transform and concat:

kernel: K_N_kernel(device->device)

I modified the code as you suggested. The preprocessing decreased from 2.15 ± 0.55 to 1.21 ± 0.25 [ms] with two clouds in cache (performance differs from PR's init benchmark due to different load). The breaking changes make harder to keep host processing (you predict that before 😃), therefore I just removed it.

After reading your kernels seriously this time, I realized that you went to great lengths to make the operations general. Usually, when writing kernels (cpu or gpu), you want to avoid all the not strictly needed branches and loops. Could you specialize the kernels so that the fastest implementation is used on runtime? (maybe leaving your generic version to the case that no fast implementation is available)

Also, since we control the driver, you usually would not need to pay much attention to reverting the endianness

Changed. Now the package is not compatible with other point cloud formats. However, there is init validation and we throw exception with appropriate logs.

https://github.com/autowarefoundation/autoware.universe/pull/6890/commits/98e31a99191553352ebc8f94a372f89e7ae348f8 Seems it's last requested modification. Since it was a big change, apart of code review please test it again during the runtime.

May 31 '24 01:05 amadeuszsz

@YoshiRi Due to having touched the launchers, we also need a review from one of the maintainers of that package. Since it is only a few lines, could you make a short review of that part?

Jun 03 '24 03:06 knzo25

autoware.universe
autoware.universe copied to clipboard

feat(lidar_transfusion): add lidar_transfusion 3D detection package

Description

Related links

Tests performed

Notes for reviewers

Interface changes

Effects on system behavior

Pre-review checklist for the PR author

In-review checklist for the PR reviewers

Post-review checklist for the PR author

autoware.universe autoware.universe copied to clipboard

feat(lidar_transfusion): add lidar_transfusion 3D detection package

Description

Related links

Tests performed

Notes for reviewers

Interface changes

Effects on system behavior

Pre-review checklist for the PR author

In-review checklist for the PR reviewers

Post-review checklist for the PR author

autoware.universe
autoware.universe copied to clipboard