Piggy comments

Repositories
Issues
Comments

Results 4 comments of


                                            Piggy

how to use atomic_add in shared memory?

I was implementing backward calculation on position weights and found it rather slow to use atomic_add only on global memory. Here is a piece of benchmarking code. ``` cuda_source =...

[QST] SmemCopyAtom and MMA_Atom for fp32?

Could anyone help? Many thanks!

[QST] SmemCopyAtom and MMA_Atom for fp32?

It seems to me that mma instructions does not support fp32 for Multiplicand A/B from https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-data-types. So can i use ldmatrix alone to accelerate the copying from smem to register?...

[QST] SmemCopyAtom and MMA_Atom for fp32?

> The NVIDIA tensor core does not natively support A/B with fp32 inputs. So it is not possible. > > alternatives are: > > 1. use .tf32 version with reduced...