Padding between sub-tensors in blocked memory layout
Hello,
I am curious to know if there is any way to define a memory layout that implements physical padding between sub-tensors represented in the nChw16c layout. Assuming padding of P, the stride between the outermost channel blocks would be HW16+P instead of the typical HW16.
Hi @alexandrelimassantana, thank you for the question. oneDNN provides submemory_desc API to create this kind of operation descriptor. User has to create a bigger memory descriptor and then create a sub-descriptor with desired strides. Could you share, please, what are you trying to achieve? Thanks.
Hi @dzarukin this is illuminating, I will take a look. Do you happen to know the oneDNN version when this feature was introduced? I do not think I saw it on the fork I use to work (v1.7.4).
I will tell you the long story so that perhaps you can answer me if what I am trying to study is feasible with the submemory_desc API. I am studying the incidence of cache conflict misses on the SIMD direct convolution. I found that activation tensors containing a number of pixels that is some power-of-two function (e.g. 28x28 = 16*49) experience more cache misses than similarly-shaped tensors (e.g. 27x27 or 29x29). I believe that this is related to the stride between sub-tensors that, in turn, becomes a power-of-two function of the cache line size (i.e. the stride between pixels, the block of 16 floats, is one cache line). Such a function for defining strides would, theoretically, cause a snowballing problem beginning with irregular pressure on the cache sets, resulting in early evictions, resulting in stalls of the SIMD units, and ultimately detracting from performance. I want to check if offsetting the stride by one cache line could result in performance improvements for layers with this characteristic as it would break the power-of-two memory access pattern with minimal memory overhead (OC+IC data elements).
Hi @alexandrelimassantana,
API is available since v0, so it definitely should be in v1.7.
I believe you are referring to effect of cache associativity for a set-associative cache. In general your observation is correct - powers of two don't help to utilize caches better but depending on the memory footprint and re-usage it might not be that big problem as it may seem to be - this effect affects mostly only memory-bound convolutions which are usually compute-bound instead.
It also depends which level of cache is targeted. To spoil L2 cache the data for a single core should have pretty significant stride (like 1k, 2k or 4k). If the thread is reading/writing data consecutively, the effect won't be there.
Most of times this problem can be solved with proper work division between threads and/or data format. Hope it helps. Thank you.