generics Note: Performance of _

Note: Performance of __ldg()

Open Robadob opened this issue 7 years ago • 0 comments

I tested your __ldg() implementation on a Titan-X pascal with CUDA 8.0 to handle vec2 items. I found this reduced performance of the entire kernel by 2.5x compared to using __ldg() on the individual float components in my latency/memory bound kernel.

Unusually the profiler's PC sampling listed the stalls as memory dependency, but gave them the colour of synchronisation stall.

I realise the repo is untouched in 3 years, not expecting any updates, just leaving a note incase anyone else is going to use it blindly.

Dec 12 '17 15:12 Robadob

generics generics copied to clipboard

Note: Performance of __ldg()

generics
generics copied to clipboard