Fix the timing in CUDA implementation

Open TonyLianLong opened this issue 3 years ago • 0 comments

The CUDA kernel is async and thus the timing is not measured correctly for GPU implementation. This PR moves the latency measurement to after memory copy since it's synchronized.

An explicit cudaDeviceSynchronize() could also be put before the measurement as an alternative to moving the measurement after the memory copy.

Dec 23 '22 20:12 TonyLianLong