generics
generics copied to clipboard
Note: Performance of __ldg()
I tested your __ldg()
implementation on a Titan-X pascal with CUDA 8.0 to handle vec2
items. I found this reduced performance of the entire kernel by 2.5x compared to using __ldg()
on the individual float
components in my latency/memory bound kernel.
Unusually the profiler's PC sampling listed the stalls as memory dependency, but gave them the colour of synchronisation stall.
I realise the repo is untouched in 3 years, not expecting any updates, just leaving a note incase anyone else is going to use it blindly.