cuPCL icon indicating copy to clipboard operation
cuPCL copied to clipboard

Cuda filter demo, cuda-pcl is worse than pcl if MemCpyTime(4.6285ms) is considered.

Open ZhenshengLee opened this issue 3 years ago • 5 comments

Hi everyone,

I added the a timespan calculation to measure the time comsuming for input and output memory allocation.

This kind of memory allocation is needed before every cuda functions calling.

The code is below.

  t1 = std::chrono::steady_clock::now();
  cudaMallocManaged(&input, sizeof(float) * 4 * nCount, cudaMemAttachHost);
  cudaStreamAttachMemAsync (stream, input );
  cudaMemcpyAsync(input, inputData, sizeof(float) * 4 * nCount, cudaMemcpyHostToDevice, stream);
  cudaStreamSynchronize(stream);

  float *output = NULL;
  cudaMallocManaged(&output, sizeof(float) * 4 * nCount, cudaMemAttachHost);
  cudaStreamAttachMemAsync (stream, output );
  cudaStreamSynchronize(stream);
  t2 = std::chrono::steady_clock::now();
  auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double, std::ratio<1, 1000>>>(t2 - t1);

And here is my test result. MemCpy by Time 4.6285ms

image

So according to the real FPS of passthrough filter

cuda-pcl(4.6285+0.456927=5.085427ms) is not better than pcl(4.25133ms).

So what's the best practice of programming with cuda-pcl?

Thanks.

ZhenshengLee avatar Apr 22 '21 06:04 ZhenshengLee

In my case , it costs 0.000736ms. But when I use the VoxelGrid in cuda-pcl, it's worse than pcl.

NJUSTzwh avatar Apr 22 '23 02:04 NJUSTzwh

In my case , it costs 0.000736ms. But when I use the VoxelGrid in cuda-pcl, it's worse than pcl.

@NJUSTzwh Hi

Your time comsuming looks creazy, could you share more details about your test?

  • hardware info(cpu and gpu)
  • input info(pointcloud size)

ZhenshengLee avatar Aug 02 '23 06:08 ZhenshengLee

大家好,

我添加了时间跨度计算来测量输入和输出内存分配所消耗的时间。

在每个 cuda 函数调用之前都需要这种内存分配。

代码如下。

  t1 = std::chrono::steady_clock::now();
  cudaMallocManaged(&input, sizeof(float) * 4 * nCount, cudaMemAttachHost);
  cudaStreamAttachMemAsync (stream, input );
  cudaMemcpyAsync(input, inputData, sizeof(float) * 4 * nCount, cudaMemcpyHostToDevice, stream);
  cudaStreamSynchronize(stream);

  float *output = NULL;
  cudaMallocManaged(&output, sizeof(float) * 4 * nCount, cudaMemAttachHost);
  cudaStreamAttachMemAsync (stream, output );
  cudaStreamSynchronize(stream);
  t2 = std::chrono::steady_clock::now();
  auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double, std::ratio<1, 1000>>>(t2 - t1);

这是我的测试结果。MemCpy 按时间 4.6285ms

图像

所以根据passthrough filter的真实FPS

cuda-pcl(4.6285 + 0.456927 = 5.085427ms)并不比pcl(4.25133ms)好。

那么使用 cuda-pcl 编程的最佳实践是什么?

谢谢。

请问你的测试机器是?

ZFcvYes avatar Dec 27 '23 14:12 ZFcvYes

请问你的测试机器是?

the program was tested on jetson xavier.

ZhenshengLee avatar Dec 28 '23 02:12 ZhenshengLee

请问你的测试机器是?

x86 pc with

  • qudro-p4000 gpu
  • 3.2GHz 12 core cpu

我使用该代码在jetson xavier nx(emmc-16G)上运行cuNDT,最大功耗下,使用的test_P.pcd和test_Q.pcd,跑了100次,平均耗时在110.5ms左右,jetson xavier AGX的FP32浮点性能是jetson xavier nx的1.6倍,但是耗时不是,请问性能对吗?您那边有计划发布其他机器的jetson性能比较吗?期待您的回复,谢谢。

ZFcvYes avatar Dec 28 '23 08:12 ZFcvYes