tdd11235813 comments

Results 38 comments of


                                            tdd11235813

Memory Performance on K80 + P100 + V100

According to texture cache do you request a 128-byte row per warp? A texture cache loves 2D access patterns, so I would expect a better bandwidth, when a warp requests...

Memory Performance on K80 + P100 + V100

Finally found some time to measure the metrics :) For full measurement results see attachment. Hope it helps. Edit: the used command: ``` nvprof --csv --log-file k40.csv --metrics tex_cache_transactions,l2_read_transactions,dram_read_transactions,tex_cache_hit_rate,tex_cache_throughput ./cachebench-tex-loads...

Memory Performance on K80 + P100 + V100

ah yes, `tex_utilization` is a good point to check! ``` # from nvprof --query-metrics K40> tex_utilization: The utilization level of the texture cache relative to the peak utilization on a...

Memory Performance on K80 + P100 + V100

quick reply, on the V100 I see Max(10) for the `dram_utilization` so it is global memory bound, while on P100 and K40 tex cache reaches Max(10).

Memory Performance on K80 + P100 + V100

ok, the Max values have been achieved on other instances, not for ´int, bool=1, int=256, int=1, int=8192´. Here, every utilization metric reported Low except the Mid of the tex utilization....

Memory Performance on K80 + P100 + V100

it looks like there is not enough data to fully utilize the texture cache as I cannot see bottlenecks in the mentioned case above. Attached all metrics measured on V100....

Memory Performance on K80 + P100 + V100

err sry, found it, the `Texture Function Unit Utilization` is the bottleneck! Edit: ``` tex_cache_hit_rate | Unified Cache Hit Rate | 100,00 % | 100,00 % | 100,00 % stall_texture...

add atomic functions

although issue itself is a bit outdated, I just want to add that few atomics like atomicXor and atomicOr are still missing. Few more notes: - Device functions have to...

"Synthetic" Benchmark for PIConGPU

this is planned as a GPU students final project for this year. Currently preparing a plan, like: - benchmark code into /benchmarks - measuring mallocMC alloc + free performance -...

Examples with switches for all possible compilers?

agreed. I had assumed that the examples shall work on different accelerators. But there should be kind of an advanced example, which show handling of multiple accelerators. If someone uses...