DenseCRF
DenseCRF copied to clipboard
Fix the timing in CUDA implementation
The CUDA kernel is async and thus the timing is not measured correctly for GPU implementation. This PR moves the latency measurement to after memory copy since it's synchronized.
An explicit cudaDeviceSynchronize() could also be put before the measurement as an alternative to moving the measurement after the memory copy.