tdd11235813
tdd11235813
According to texture cache do you request a 128-byte row per warp? A texture cache loves 2D access patterns, so I would expect a better bandwidth, when a warp requests...
Finally found some time to measure the metrics :) For full measurement results see attachment. Hope it helps. Edit: the used command: ``` nvprof --csv --log-file k40.csv --metrics tex_cache_transactions,l2_read_transactions,dram_read_transactions,tex_cache_hit_rate,tex_cache_throughput ./cachebench-tex-loads...
ah yes, `tex_utilization` is a good point to check! ``` # from nvprof --query-metrics K40> tex_utilization: The utilization level of the texture cache relative to the peak utilization on a...
quick reply, on the V100 I see Max(10) for the `dram_utilization` so it is global memory bound, while on P100 and K40 tex cache reaches Max(10).
ok, the Max values have been achieved on other instances, not for ´int, bool=1, int=256, int=1, int=8192´. Here, every utilization metric reported Low except the Mid of the tex utilization....
it looks like there is not enough data to fully utilize the texture cache as I cannot see bottlenecks in the mentioned case above. Attached all metrics measured on V100....
err sry, found it, the `Texture Function Unit Utilization` is the bottleneck! Edit: ``` tex_cache_hit_rate | Unified Cache Hit Rate | 100,00 % | 100,00 % | 100,00 % stall_texture...
although issue itself is a bit outdated, I just want to add that few atomics like atomicXor and atomicOr are still missing. Few more notes: - Device functions have to...
this is planned as a GPU students final project for this year. Currently preparing a plan, like: - benchmark code into /benchmarks - measuring mallocMC alloc + free performance -...
agreed. I had assumed that the examples shall work on different accelerators. But there should be kind of an advanced example, which show handling of multiple accelerators. If someone uses...