cutlass icon indicating copy to clipboard operation
cutlass copied to clipboard

[QST] how bank conflict in shared memory is fixed in depthwise conv

Open yupatrick22 opened this issue 2 years ago • 6 comments

What is your question? bank conflict plays extremely important role in smem perf. how is it solved in depthwise conv? @Ethan-Yan27

yupatrick22 avatar Dec 07 '23 14:12 yupatrick22

The elements loaded are just consecutively stored in smem. The bottleneck of depthwise conv is mainly in Dram and L2, so did not do padding or swizzling techniques.
If you are interested in how to implement smem related operation. please refer to https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/conv/threadblock/depthwise_mma_core_with_lane_access_size.h#L847

https://github.com/NVIDIA/cutlass/blob/f4a021660162510572f90ea715b018cff9c0f12f/include/cutlass/transform/threadblock/regular_tile_access_iterator_pitch_linear_direct_conv.h

Ethan-Yan27 avatar Dec 08 '23 12:12 Ethan-Yan27

无标题

From the source code, it looks like 2 different data reuse strategies are for Koptimized and KfixedStrideDilation. For Koptimized, as above code shows, each thread will first calculate the offset then load fragment A (size is tileP*tileQ) from smem, it will repeat RS times.

While for KfixedStrideDilation, the input tile (i.e. all the dependent activation to calculate fragment C) will first be loaded in register file, then performs static (that compiler can handles at complied time) load from the input tile into fragment A.

Why Koptimized is designed like this?

What will happen, if Koptimized uses the strategy of KfixedStrideDilation? Thread local memory will be used?

@Ethan-Yan27

yupatrick22 avatar Dec 09 '23 14:12 yupatrick22

Since the sample code 46 do the only alpha scaling epilogue, I think the kernel will output data to tensor_d, but will it also output to tensor_c? @Ethan-Yan27

yupatrick22 avatar Dec 10 '23 12:12 yupatrick22

What will happen, if Koptimized uses the strategy of KfixedStrideDilation? Thread local memory will be used?

Right, If we apply similar strategy, kernel would probably hit register spilling issue.

In general, for KfixedStrideDilation, kernel persistents some inputs into register to squeeze more performance, so large filter/stride/dilation is not recommended.

Since the sample code 46 do the only alpha scaling epilogue, I think the kernel will output data to tensor_d, but will it also output to tensor_c?

No, it would not write to tensor_c. because epilogue scale operation is OnlyAlphaScaling, the tensor_c would be unused.

Ethan-Yan27 avatar Dec 12 '23 13:12 Ethan-Yan27

@yupatrick22 has your issue been resolved?

mnicely avatar Jan 02 '24 15:01 mnicely

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Feb 01 '24 16:02 github-actions[bot]