Piggy
Piggy
I was implementing backward calculation on position weights and found it rather slow to use atomic_add only on global memory. Here is a piece of benchmarking code. ``` cuda_source =...
Could anyone help? Many thanks!
It seems to me that mma instructions does not support fp32 for Multiplicand A/B from https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-data-types. So can i use ldmatrix alone to accelerate the copying from smem to register?...
> The NVIDIA tensor core does not natively support A/B with fp32 inputs. So it is not possible. > > alternatives are: > > 1. use .tf32 version with reduced...