marian-dev Use bias epilogue in GPU affine operation if CUDA >= 10.1

Use bias epilogue in GPU affine operation if CUDA >= 10.1

Open kpu opened this issue 4 years ago • 10 comments

https://github.com/marian-nmt/marian-dev/blob/master/src/graph/node_operators_binary.h#L256-L265 is rather inefficient in bias application: it uses a second GEMM with 1s to add the bias term.

In CUDA 10.1 there is explicit bias support using https://docs.nvidia.com/cuda/cublas/index.html#cublasLtEpilogue_t CUBLASLT_EPILOGUE_BIAS = 4 "Apply (broadcasted) bias from the bias vector. Bias vector length must match matrix D rows, and it must be packed (i.e., stride between vector elements is 1). Bias vector is broadcasted to all columns and added before applying the final postprocessing."

I'm not sure if it's there in prior versions.

Hat tip to @XapaJIaMnu.

Jul 17 '20 15:07 kpu

Oh, good to know, will take a look.

Jul 17 '20 15:07 emjotde

Do we have similar things for MKL?

Jul 17 '20 16:07 emjotde

Regarding MKL, paging @sidkashyap-at-Intel

Jul 17 '20 16:07 kpu

Regarding the CPU, I recall the current implementation was efficient. I have run several different options to add bias, but this one was fastest for the student models on single core at the time of wngt 19.

Jul 17 '20 22:07 ykim362

I had similar experiences with the GPU, but I did not use anything like proposed here. Unsurprisingly this is particularly efficient in the backward step.

Jul 17 '20 23:07 emjotde

@emjotde yes, that makes sense. fbgemm also has a bias epilogue, but it didn't help.

Jul 17 '20 23:07 ykim362

This is a new feature in cuda 10.1, which was released long after this code was written. I think it's worth investigating

Jul 17 '20 23:07 XapaJIaMnu

It may help more on a GPU than on a CPU, as GPUs have more cores and are therefore more bottlenecked by memory bandwidth.

Jul 17 '20 23:07 frankseide

@XapaJIaMnu absolutely.

Jul 17 '20 23:07 emjotde

I have added a code path for CUDA >= 11 that will use the cublasLt bias fused op when possible.

FYI - cuda versions before 11 only support this op for int8.

Will make a new PR with this soon^TM

Nov 05 '20 01:11 rhenry-nv

marian-dev marian-dev copied to clipboard

Use bias epilogue in GPU affine operation if CUDA >= 10.1

marian-dev
marian-dev copied to clipboard