cuPCL
cuPCL copied to clipboard
Cuda filter demo, cuda-pcl is worse than pcl if MemCpyTime(4.6285ms) is considered.
Hi everyone,
I added the a timespan calculation to measure the time comsuming for input and output memory allocation.
This kind of memory allocation is needed before every cuda functions calling.
The code is below.
t1 = std::chrono::steady_clock::now();
cudaMallocManaged(&input, sizeof(float) * 4 * nCount, cudaMemAttachHost);
cudaStreamAttachMemAsync (stream, input );
cudaMemcpyAsync(input, inputData, sizeof(float) * 4 * nCount, cudaMemcpyHostToDevice, stream);
cudaStreamSynchronize(stream);
float *output = NULL;
cudaMallocManaged(&output, sizeof(float) * 4 * nCount, cudaMemAttachHost);
cudaStreamAttachMemAsync (stream, output );
cudaStreamSynchronize(stream);
t2 = std::chrono::steady_clock::now();
auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double, std::ratio<1, 1000>>>(t2 - t1);
And here is my test result. MemCpy by Time 4.6285ms
So according to the real FPS of passthrough filter
cuda-pcl(4.6285+0.456927=5.085427ms) is not better than pcl(4.25133ms).
So what's the best practice of programming with cuda-pcl?
Thanks.
In my case , it costs 0.000736ms. But when I use the VoxelGrid in cuda-pcl, it's worse than pcl.
In my case , it costs 0.000736ms. But when I use the VoxelGrid in cuda-pcl, it's worse than pcl.
@NJUSTzwh Hi
Your time comsuming looks creazy, could you share more details about your test?
- hardware info(cpu and gpu)
- input info(pointcloud size)
大家好,
我添加了时间跨度计算来测量输入和输出内存分配所消耗的时间。
在每个 cuda 函数调用之前都需要这种内存分配。
代码如下。
t1 = std::chrono::steady_clock::now(); cudaMallocManaged(&input, sizeof(float) * 4 * nCount, cudaMemAttachHost); cudaStreamAttachMemAsync (stream, input ); cudaMemcpyAsync(input, inputData, sizeof(float) * 4 * nCount, cudaMemcpyHostToDevice, stream); cudaStreamSynchronize(stream); float *output = NULL; cudaMallocManaged(&output, sizeof(float) * 4 * nCount, cudaMemAttachHost); cudaStreamAttachMemAsync (stream, output ); cudaStreamSynchronize(stream); t2 = std::chrono::steady_clock::now(); auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double, std::ratio<1, 1000>>>(t2 - t1);
这是我的测试结果。MemCpy 按时间 4.6285ms
所以根据passthrough filter的真实FPS
cuda-pcl(4.6285 + 0.456927 = 5.085427ms)并不比pcl(4.25133ms)好。
那么使用 cuda-pcl 编程的最佳实践是什么?
谢谢。
请问你的测试机器是?
请问你的测试机器是?
the program was tested on jetson xavier.
请问你的测试机器是?
x86 pc with
- qudro-p4000 gpu
- 3.2GHz 12 core cpu
我使用该代码在jetson xavier nx(emmc-16G)上运行cuNDT,最大功耗下,使用的test_P.pcd和test_Q.pcd,跑了100次,平均耗时在110.5ms左右,jetson xavier AGX的FP32浮点性能是jetson xavier nx的1.6倍,但是耗时不是,请问性能对吗?您那边有计划发布其他机器的jetson性能比较吗?期待您的回复,谢谢。