cuPCL Cuda filter demo, cuda-pcl is worse than pcl if MemCpyTime(4.6285ms) is considered.

Hi everyone,

I added the a timespan calculation to measure the time comsuming for input and output memory allocation.

This kind of memory allocation is needed before every cuda functions calling.

The code is below.

  t1 = std::chrono::steady_clock::now();
  cudaMallocManaged(&input, sizeof(float) * 4 * nCount, cudaMemAttachHost);
  cudaStreamAttachMemAsync (stream, input );
  cudaMemcpyAsync(input, inputData, sizeof(float) * 4 * nCount, cudaMemcpyHostToDevice, stream);
  cudaStreamSynchronize(stream);

  float *output = NULL;
  cudaMallocManaged(&output, sizeof(float) * 4 * nCount, cudaMemAttachHost);
  cudaStreamAttachMemAsync (stream, output );
  cudaStreamSynchronize(stream);
  t2 = std::chrono::steady_clock::now();
  auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double, std::ratio<1, 1000>>>(t2 - t1);

And here is my test result. MemCpy by Time 4.6285ms

So according to the real FPS of passthrough filter

cuda-pcl(4.6285+0.456927=5.085427ms) is not better than pcl(4.25133ms).

So what's the best practice of programming with cuda-pcl?

Thanks.

Apr 22 '21 06:04 ZhenshengLee

In my case , it costs 0.000736ms. But when I use the VoxelGrid in cuda-pcl, it's worse than pcl.

Apr 22 '23 02:04 NJUSTzwh

In my case , it costs 0.000736ms. But when I use the VoxelGrid in cuda-pcl, it's worse than pcl.

@NJUSTzwh Hi

Your time comsuming looks creazy, could you share more details about your test?

hardware info(cpu and gpu)
input info(pointcloud size)

Aug 02 '23 06:08 ZhenshengLee

大家好，

我添加了时间跨度计算来测量输入和输出内存分配所消耗的时间。

在每个 cuda 函数调用之前都需要这种内存分配。

代码如下。
  t1 = std::chrono::steady_clock::now();
  cudaMallocManaged(&input, sizeof(float) * 4 * nCount, cudaMemAttachHost);
  cudaStreamAttachMemAsync (stream, input );
  cudaMemcpyAsync(input, inputData, sizeof(float) * 4 * nCount, cudaMemcpyHostToDevice, stream);
  cudaStreamSynchronize(stream);

  float *output = NULL;
  cudaMallocManaged(&output, sizeof(float) * 4 * nCount, cudaMemAttachHost);
  cudaStreamAttachMemAsync (stream, output );
  cudaStreamSynchronize(stream);
  t2 = std::chrono::steady_clock::now();
  auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double, std::ratio<1, 1000>>>(t2 - t1);
这是我的测试结果。MemCpy 按时间 4.6285ms

所以根据passthrough filter的真实FPS

cuda-pcl（4.6285 + 0.456927 = 5.085427ms）并不比pcl（4.25133ms）好。

那么使用 cuda-pcl 编程的最佳实践是什么？

谢谢。

请问你的测试机器是？

Dec 27 '23 14:12 ZFcvYes

请问你的测试机器是？

the program was tested on jetson xavier.

Dec 28 '23 02:12 ZhenshengLee

请问你的测试机器是？

x86 pc with

qudro-p4000 gpu

3.2GHz 12 core cpu

我使用该代码在jetson xavier nx(emmc-16G)上运行cuNDT，最大功耗下，使用的test_P.pcd和test_Q.pcd，跑了100次，平均耗时在110.5ms左右，jetson xavier AGX的FP32浮点性能是jetson xavier nx的1.6倍，但是耗时不是，请问性能对吗？您那边有计划发布其他机器的jetson性能比较吗？期待您的回复，谢谢。

Dec 28 '23 08:12 ZFcvYes

cuPCL cuPCL copied to clipboard

Cuda filter demo, cuda-pcl is worse than pcl if MemCpyTime(4.6285ms) is considered.

cuPCL
cuPCL copied to clipboard