pcl
pcl copied to clipboard
GPU EuclideanClusterExtraction is still slower than CPU version.
Hi, I tested the GPU version cluster. But it is still slower than CPU when processing one pcd file which contains 196015 points. the hardware of my computer: GPU: 2070 mobile. CPU:Intel® Core™ i7-10750H CUDA: 12.0 PCL:1.13.0
The test script is "pcl-1.13.0/gpu/examples/segmentation/src/seg.cpp".
Any suggestions?
The output:
INFO: PointCloud_filtered still has 196015 Points CPU Time taken: 1.15s PointCloud representing the Cluster: 339 data points. PointCloud representing the Cluster: 339 data points. PointCloud representing the Cluster: 201 data points. PointCloud representing the Cluster: 186 data points. PointCloud representing the Cluster: 179 data points. PointCloud representing the Cluster: 175 data points. PointCloud representing the Cluster: 167 data points. PointCloud representing the Cluster: 134 data points. PointCloud representing the Cluster: 120 data points. PointCloud representing the Cluster: 119 data points. PointCloud representing the Cluster: 107 data points.
INFO: starting with the GPU version GPU Time taken: 1.55s INFO: stopped with the GPU version PointCloud representing the Cluster: 119 data points. PointCloud representing the Cluster: 186 data points. PointCloud representing the Cluster: 134 data points. PointCloud representing the Cluster: 201 data points. PointCloud representing the Cluster: 339 data points. PointCloud representing the Cluster: 167 data points. PointCloud representing the Cluster: 107 data points. PointCloud representing the Cluster: 120 data points. PointCloud representing the Cluster: 179 data points. PointCloud representing the Cluster: 175 data points. PointCloud representing the Cluster: 339 data points.
Copying data to and from the GPU takes some time. Whether the GPU version is faster than the CPU version depends on how powerful your GPU is. Also, your CPU is pretty good, so it will be difficult for your GPU to beat that. You can try to change this number. If you make it bigger or smaller, the GPU version might get faster.
@mvieth Is that number point size? If I understand correctly, below that value, extract
still uses the CPU for point cloud clustering.
@mvieth Is that number point size? If I understand correctly, below that value,
extract
still uses the CPU for point cloud clustering.
Not sure what you mean by "point size". It is not the total number of points in the point cloud. It is more like the number of immediate neighbours of the current point in the algorithm. If the 10
is replaced by a smaller number, then the algorithm will use the GPU more often for searching, if 10
is replaced by a larger number, then the CPU will be used more often for searching. But either way, CPU and GPU will both be used, unless you replace 10
by 0 or infinity.
@mvieth I see. Thank you so much. I was misunderstood.
@mvieth Today I have tested the PCL GPU Cluster with the new Nvidia Orin 64 GB version. I have still used the default number 10
and the result:
INFO] [1713520204.334639066] [cloud_seg_node]: Cluster number: 17
CPU Time taken: 0.15s
GPU Time taken: 0.84s
[INFO] [1713520205.749163099] [cloud_seg_node]: Cluster number: 30
CPU Time taken: 0.15s
GPU Time taken: 0.76s
[INFO] [1713520207.071808625] [cloud_seg_node]: Cluster number: 30
CPU Time taken: 0.13s
GPU Time taken: 0.54s
[INFO] [1713520208.065284638] [cloud_seg_node]: Cluster number: 30
CPU Time taken: 0.13s
GPU Time taken: 0.63s
[INFO] [1713520209.168630773] [cloud_seg_node]: Cluster number: 30
The Nvidia Orin has Nvidia Ampere-GPU with 2048 cores and 64 Tensor cores. I thought that its GPU should be much better that its CPU. Our goal is to limit the clustering time to less than 0.1s. But the result seems to be not what I thought. By modifying the number until a certain value, will the time used by the cluster with GPU be smaller than the time used with CPU? I still doubt it. I have not had time to figure it out yet because it took a lot of time to compile once. Have you ever succeeded before?
By modifying the number until a certain value, will the time used by the cluster with GPU be smaller than the time used with CPU?
I have not tried changing that number, and I can't predict how the effect will be with your point cloud, your parameters, and your Nvidia device.
I think one big disadvantage of the current GPU clustering implementation is that all the search results on the GPU have to be transferred back to the CPU memory (happens inside the economical_download
). These results are highly redundant because several points in a cluster will all have each other as neighbours, so this might amount to several megabytes, perhaps even gigabytes for very large point clouds. A better implementation might process those results further on the GPU before copying them back to CPU memory.