`L1::no_allocate` is not safe to load volatile data
for the ptx instruction ld.global.nc.L1::no_allocate.L2::256B you mentioned, on devices which global memory is cached in L1 by default, such as Volta to Blackwell (sm70+), it's equivalent to ld.global.L1::no_allocate.L2::256B.
L1::no_allocate means the memory instruction doesn't allocate L1 cache line or check 'hit on miss' (the address is on the same cache line with a previous cache missed memory request), so access the data in a stale L1 cache line will not fetch the cache line from L2 or DRAM, and can not get the updated data from other SM.
for the ptx instruction ld.global.nc.L1::no_allocate.L2::256B you mentioned, on devices which global memory is cached in L1 by default, such as Volta to Blackwell (sm70+), it's equivalent to ld.global.L1::no_allocate.L2::256B.
Maybe not exactly? From SASS level, it is. But for NVCC, .nc seems giving more opportunities to optimize, e.g. IIRC, normal loads can not do automatic NVCC unrolling.
L1::no_allocate means the memory instruction doesn't allocate L1 cache line or check 'hit on miss' (the address is on the same cache line with a previous cache missed memory request), so access the data in a stale L1 cache line will not fetch the cache line from L2 or DRAM, and can not get the updated data from other SM.
I am not sure about the hardware impl, but the behavior is tested to be correct. See README for how we found this.
Undefined-behavior PTX usage For extreme performance, we discover and use an undefined-behavior PTX usage: using read-only PTX ld.global.nc.L1::no_allocate.L2::256B to read volatile data. The PTX modifier .nc indicates that a non-coherent cache is used. But the correctness is tested to be guaranteed with .L1::no_allocate on Hopper architectures, and performance will be much better. The reason we guess may be: the non-coherent cache is unified with L1, and the L1 modifier is not just a hint but a strong option, so that the correctness can be guaranteed by no dirty data in L1. Initially, because NVCC could not automatically unroll volatile read PTX, we tried using __ldg (i.e., ld.nc). Even compared to manually unrolled volatile reads, it was significantly faster (likely due to additional compiler optimizations). However, the results could be incorrect or dirty. After consulting the PTX documentation, we discovered that L1 and non-coherent cache are unified on Hopper architectures. We speculated that .L1::no_allocate might resolve the issue, leading to this discovery. If you find kernels not working on some other platforms, you may add DISABLE_AGGRESSIVE_PTX_INSTRS=1 to setup.py and disable this, or file an issue.
Correct me if I am wrong :)
Maybe not exactly? From SASS level, it is. But for NVCC
yes, for sm70+ ISA, SASS of ld.global.nc.L1::no_allocate and ld.global.L1::no_allocate are LDG.E.NA.[...].CONSTANT and LDG.E.NA[...], they are the same for memory coherence (cross SMs coherence is not guaranteed for both), but nc means the memory is read only and the address is not an alias of other pointers, so the compiler will reuse the data via registers and prune redundant memory instructions, and performance may be better with nc.
.ncseems giving more opportunities to optimize, e.g. IIRC, normal loads can not do automatic NVCC unrolling.
while and for loop can be unrolled or not usually depend on the loop stop condition and resource limitation (such as registers, barriers), if the loop stop condition can be evaluated at compile time and resource is enough, the loop can be unrolled, even asm volatile segment or ld.volatile.global or ld.global.cg instructions inside the loop.
memory coherence, for example the 'producer-consumer' case, producer warp on SM0 store data to a buffer in global memory and consumer warp on SM1 load the data from buffer, ld.global.nc.L1::no_allocate is not safe. if the buffer's cache line was accessed by LDG instructions except LDG.NA on SM1 before the producer warp's store instruction, the L1 cache line was already allocated on SM1. LDG.NA means do not allocate new L1 cache line and check 'hit on miss', but for L1 hit requests, it returns data from L1 cache. so the consumer warp on SM1 load data in buffer by LDG.NA or LDG.NA.CONSTANT may get a stale data, ld.global.cg (LDG.STRONG.GPU) should be used in such case.
further more, Ampere gpus with partitioned crossbar (L2 cache is also partitioned), such as A100, two L2 partitions are coherent. DRAM split to 8KB pages and each page associate with only 1 L2 partition. SMs are also split to 2 groups and each group connect 1 L2 partition directly, so for each SM, there are 1 near L2 partition (low latency) and 1 far L2 partition (cross L2 fabric, long latency).
LDG.STRONG.GPU allocate L2 cache line only in the DRAM page associated L2 partition, LDG.NA allocate cache line in both L2 partitions. so for the size of hot read-only data is not more than half of L2 size, LDG or LDG.NA may be better because all SMs can access data from near-L2-partition. if L2_size/2 < readonly_hot_data_size, LDG.STRONG.GPU may be better because there's no redundant L2 cache line and L2 hit rate can be better. for L2 cache read-after-write intensive case, LDG.STRONG.GPU may be better because of the additional L2 cache line synchronization cost of LDG.NA.
Hopper GPU's L2 is also 2 partitoned, but there's a L2 cache request coalescer (LRC), L2 cache latency and bandwidth for all SMs are almost the same, LDG.NA and LDG.STRONG.GPU are usually the same except the default L2 cache line evict priority. for a prior STG (all STG except STG.NA) accessed 128B line, LDG.NA on the same SM access data in the 128B line will also allocate 32B L1 cache line, even the 'NA' means L1 no allocate.
@LyricZhao
Got it, and thanks very much for your detailed explanation! I will fix the related code later (towards semantic correctness).