marian-dev icon indicating copy to clipboard operation
marian-dev copied to clipboard

Use bias epilogue in GPU affine operation if CUDA >= 10.1

Open kpu opened this issue 4 years ago • 10 comments

https://github.com/marian-nmt/marian-dev/blob/master/src/graph/node_operators_binary.h#L256-L265 is rather inefficient in bias application: it uses a second GEMM with 1s to add the bias term.

In CUDA 10.1 there is explicit bias support using https://docs.nvidia.com/cuda/cublas/index.html#cublasLtEpilogue_t CUBLASLT_EPILOGUE_BIAS = 4 "Apply (broadcasted) bias from the bias vector. Bias vector length must match matrix D rows, and it must be packed (i.e., stride between vector elements is 1). Bias vector is broadcasted to all columns and added before applying the final postprocessing."

I'm not sure if it's there in prior versions.

Hat tip to @XapaJIaMnu.

kpu avatar Jul 17 '20 15:07 kpu

Oh, good to know, will take a look.

emjotde avatar Jul 17 '20 15:07 emjotde

Do we have similar things for MKL?

emjotde avatar Jul 17 '20 16:07 emjotde

Regarding MKL, paging @sidkashyap-at-Intel

kpu avatar Jul 17 '20 16:07 kpu

Regarding the CPU, I recall the current implementation was efficient. I have run several different options to add bias, but this one was fastest for the student models on single core at the time of wngt 19.

ykim362 avatar Jul 17 '20 22:07 ykim362

I had similar experiences with the GPU, but I did not use anything like proposed here. Unsurprisingly this is particularly efficient in the backward step.

emjotde avatar Jul 17 '20 23:07 emjotde

@emjotde yes, that makes sense. fbgemm also has a bias epilogue, but it didn't help.

ykim362 avatar Jul 17 '20 23:07 ykim362

This is a new feature in cuda 10.1, which was released long after this code was written. I think it's worth investigating

XapaJIaMnu avatar Jul 17 '20 23:07 XapaJIaMnu

It may help more on a GPU than on a CPU, as GPUs have more cores and are therefore more bottlenecked by memory bandwidth.

frankseide avatar Jul 17 '20 23:07 frankseide

@XapaJIaMnu absolutely.

emjotde avatar Jul 17 '20 23:07 emjotde

I have added a code path for CUDA >= 11 that will use the cublasLt bias fused op when possible.

FYI - cuda versions before 11 only support this op for int8.

Will make a new PR with this soonTM

rhenry-nv avatar Nov 05 '20 01:11 rhenry-nv